Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Exercise #4: Bowtie2 global alignment - Vibrio cholerae RNA-seq

While we have focused on aligning eukaryotic data, the same tools can be used with prokaryotic data. The major differences are less about the underlying data and much more about the external/public databases that store and distribute reference data. If we want to study a prokaryote, the reference data is usually downloaded from a resource like GenBank.  

While the alignment procedure for prokaryotes is broadly analogous, the reference preparation process is somewhat different, and will involve use of a biologically-oriented scripting library called BioPerl.  In this exercise, we will use some RNA-seq data from Vibrio cholerae, published last year on GEO here, and align it to a reference genome.

Overview of Vibrio cholerae alignment workflow with Bowtie2

Alignment of this prokaryotic data follows the workflow below. Here we will concentrate on steps 1 and 2.

  1. Prepare the vibCho reference index for bowtie2 from a GenBank record using BioPerl
  2. Align reads using bowtie2, producing a SAM file
  3. Convert the SAM file to a BAM file (samtools view) 
  4. Sort the BAM file by genomic location (samtools sort)
  5. Index the BAM file (samtools index)
  6. Gather simple alignment statistics (samtools flagstat and samtools idxstat)

Obtaining the GenBank record(s)

First prepare a directory to work in, and change to it:

...

  1. Navigate to http://www.ncbi.nlm.nih.gov/nuccore/NC_012582
    • click on the Send to down arrow (top right of page)
    • select Complete Record
    • select Clipboard as Destination
    • click Add to Clipboard
  2. Perform these steps in your Terminal window
  3. Repeat steps 1 and 2 fot the 2nd chromosome
  4. Combine the 2 files into one using cat
    • cat NC_012582  NC_012583 > vibCho.gbk

Converting GenBank records into sequence (FASTA) and annotation (GFF) files

As noted earlier, many microbial genomes are available through repositories like GenBank that use specific file format conventions for storage and distribution of genome sequence and annotations. The GenBank file format is a text file that can be parsed to yield other files that are compatible with the pipelines we have been implementing.

...

After the header lines, each feature in the genome is represented by a line that gives chromosome, start, stop, strand, and other information.  Features are things like "mRNA," "CDS," and "EXON."  As you would expect in a prokaryotic genome it is frequently the case that the gene, mRNA, CDS, and exon annotations are identical, meaning they share coordinate information. You could parse these files further using commands like grep  and awk  to extract, say, all exons from the full file or to remove the header lines that begin with #.

Introducing bowtie2

Go ahead and load the bowtie2 module so we can examine some help pages and options. To do that, you must first load the perl module, and then the a specific version of bowtie2

...

This is the key to using bowtie2 - it allows you to control almost everything about its behavior, but that also makes it is much more challenging to use than bwa – and it's easier to screw things up too!

Building the bowtie2 vibCho index

Before the alignment, of course, we've got to build a mirbase index using bowtie2-build (go ahead and check out its options). Unlike for the aligner itself, we only need to worry about a few things here:

...

This should also go pretty fast. You can see the resulting files using ls like before.

Performing the bowtie2 alignment

Now we will go back to our scratch area to do the alignment, and set up symbolic links to the index in the work area to simplify the alignment command:

...

When the job is complete you should have a cholera_rnaseq.sam file that you can examine using whatever commands you like.

Exercise #3: Bowtie2 local alignment - Human microRNA-seq

Now we're going to switch over to a different aligner that was originally designed for very short reads and is frequently used for RNA-seq data. Accordingly, we have prepared another test microRNA-seq dataset for you to experiment with (not the same one you used cutadapt on). This data is derived from a human H1 embryonic stem cell (H1-hESC) small RNA dataset generated by the ENCODE Consortium – its about a half million reads.

However, there is a problem!  We don't know (or, well, you don't know) what the adapter structure or sequences were. So, you have a bunch of 36 base pair reads, but many of those reads will include extra sequence that can impede alignment – and we don't know where! We need an aligner that can find subsections of the read that do align, and discard (or "soft-clip") the rest away – an aligner with a local alignment mode. Bowtie2 is just such an aligner.

Overview miRNA alignment workflow with bowtie2

If the adapter structure were known, the normal workflow would be to first remove the adapter sequences with cutadapt. Since we can't do that, we will instead perform a local alignment of the single-end miRNA sequences using bowtie2. This workflow has the following steps:

...

These are the four reference genomes we will be using today, with some information about them (and here is information about many more genomes):


Building the bowtie2 mirbase index

Before the alignment, of course, we've got to build a mirbase index using bowtie2-build (go ahead and check out its options). Unlike for the aligner itself, we only need to worry about a few things here:

...

Code Block
titlebowtie2 index files for miRNAs
hairpin_cDNA_hsa.fa
hairpin_cDNA_hsa.fa.1.bt2
hairpin_cDNA_hsa.fa.2.bt2
hairpin_cDNA_hsa.fa.3.bt2
hairpin_cDNA_hsa.fa.4.bt2
hairpin_cDNA_hsa.fa.rev.1.bt2
hairpin_cDNA_hsa.fa.rev.2.bt2

Performing the bowtie2 local alignment

Now, we're ready to actually try to do the alignment.  Remember, unlike BWA, we actually need to set some options depending on what we're after. Some of the important options for bowtie2 are:

...

Expand
titleSolution

We can use our cut / grep trick from Exercise #1, but on the human_mirnaseq.sam file. Since all read names in this file start with TUPAC, we'll use that pattern to select non-header lines.

Code Block
languagebash
titleCount aligned SAM records
grep -P -v '^@' human_mirnaseq.sam | cut -f 3 | grep -v '*' | wc -l

This expressions returns 221086.

Use sort and uniq to create a histogram of mapping qualities

The mapping quality score is in field 5 of the human_mirnaseq.sam file. We can do this to pull out only that field:

...

Expand
titleAnswer

We are looking at mapping quality values for both aligned and un-aligned records, but mapping quality only makes sense for aligned reads. This expression does not distinguish between mapping quality = 0 because the read mapped to multiple locations, and mapping quality = 0 because the sequence did not align.

The proper solution will await the use of samtools to filter out unmapped reads.

Exercise #4: BWA-MEM - Human mRNA-seq

After bowtie2 came out with a local alignment option, it wasn't long before bwa developed its own local alignment algorithm called BWA-MEM (for Maximal Exact Matches), implemented by the bwa mem command. bwa mem has the following advantages:

...

Expand
titleHint:
Code Block
languagebash
cd $SCRATCH/core_ngs/alignment
ls fastq
gunzip -c fastq/human_rnaseq.fastq.gz | echo $((`wc -l`/4))

A word about real splice-aware aligners

Using BWA mem for RNA-seq alignment is sort of a "poor man's" RNA-seq alignment method. Real splice-aware aligners like tophat2 or star have more complex algorithms (as shown below) – and take a lot more time!

RNA-seq alignment with bwa aln

Now, try aligning it with bwa aln like we did in Example #1, but first link to the hg19 bwa index directory.  In this case, due to the size of the hg19 index, we are linking to Anna's scratch area INSTEAD of our own work area containing indexes that we built ourselves.

...

Expand
titleAnswer

Because this file was generated exclusively from reads in a larger dataset that cross at least one splice junction. The sequences as they exists in most of the reads do not correspond to a single location in the genome. However subsections of each read do exist somewhere in the genome.

So, we need an aligner that is capable aligning different parts of the read to different genomic loci.

RNA-seq alignment with bwa mem

Exercise: use bwa mem to align the same data

...

Tip

Be aware that some downstream tools (for example the Picard suite, often used before SNP calling) do not like it when a read name appears more than once in the SAM file. To mark the extra alignment records as secondary, specify the bwa mem -M option. This option leaves the best (longest) alignment for a read as -is but marks additional alignments for the read as secondary (the 0x100 BAM flag). This designation also allows you to easily filter the secondary reads with samtools if desired.

BWA-MEM vs Tophat

Another approach to aligning long RNA-seq data is to use an aligner that is more explicitly concerned with sensitivity to splice sites, namely a program like Tophat. Tophat uses either bowtie (tophat) or bowtie2 (tophat2) as the actual aligner, but performs the following steps:

...