The first step in nearly every next-gen sequence analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll run some common mapping tools on TACC.

The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular.

There are over 50 read mapping programs listed here. We're going to (mainly) stick to BWA in this class.

Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.).

Mapping tools summary

Tool	TACC	Version	Download	Manual
BWA	`module load bwa/0.6.2`	0.6.1; 0.6.2	link	link

See if you can find some other modules on Lonestar that pertain to Alignment


login1$ module key Alignment
----------------------------------------------------------------------------
The following modules match your search criteria: "Alignment"
----------------------------------------------------------------------------

  beast: beast/1.7, beast/1.7.2
    Tool for Bayesian MCMC analysis of molecular sequences

  bismark: bismark/0.7.2, bismark/0.7.4
    bismark - A tool to map bisulfite converted sequence reads and
    determine cytosine

  blast: blast/2.2.26
    NCBI BLAST+ sequence alignment package. The program compares nucleotide
    or protein sequences to sequence databases and calculates the
    statistical significance of matches.

  bowtie: bowtie/0.12.8, bowtie/2.0.0b6
    Ultrafast, memory-efficient short read aligner

  bwa: bwa/0.6.1, bwa/0.6.2
    bwa - Burrows-Wheeler Alignment Tool

  clustalw2: clustalw2/2.1
    Tool for multiple sequence alignment

  gsnap: gsnap/20120320, gsnap/20120620
    gsnap - Genomic Short-read Nucleotide Alignment Program

  mafft: mafft/6.864, mafft/6.903
    Multiple alignment program for amino acid or nucleotide sequences

  mummer: mummer/3.23
    MUMmer - A modular system for the rapid whole genome alignment of
    finished or draft sequence

  plink: plink/1.07
    plink - Whole genome association analysis toolset

  tophat: tophat/1.4.1, tophat/2.0.4
    TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns
    RNA-Seq reads to mammalian-sized genomes using the ultra
    high-throughput short read aligner Bowtie, and then analyzes the
    mapping results to identify splice junctions between exons.

Obviously, the word Alignment pops up the secription of a few non-NGS modules, but you can see the utility in modules' keyword search capabilities.

Example: E. coli genome re-sequencing data

Before We Start

In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BI to point to the current filesystem location of the BioITeam directory. For any shell you open that accesses Lonestar during today's tutorial, please enter the following command:


export BI=/corral-repl/utexas/BioITeam

We also have some handy scripts for you to use, but we need to add them to your path for convenience:


export PATH=$PATH:$BI/bin

Note: to see what that did, you can type

echo $PATH

and you should see the new BioITeam folder at the end. The PATH variable contains all the places Linux looks for all the commands you type.

Finally, we will use bioperl in our scripts, so let's load that into our environment:

module load perl bioperl

Data

The data files for this example are in the path:


$BI/ngs_course/intro_to_mapping/data

File Name	Description	Sample
`SRR030257_1.fastq`	Paired-end Illumina, First of pair, FASTQ format	Re-sequenced E. coli genome
`SRR030257_2.fastq`	Paired-end Illumina, Second of pair, FASTQ format	Re-sequenced E. coli genome
`NC_012967.1.gbk`	Reference Genome in Genbank format	E. coli B strain REL606

The easiest way to run the tutorial is to copy this entire directory to your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.


login1$ ls
NC_012967.1.gbk  SRR030257_1.fastq  SRR030257_2.fastq


cds
cp -r $BI/ngs_course/intro_to_mapping/data intro_to_mapping
cd intro_to_mapping

Exercises

What basic linux commands could help us quickly peek at the files that were just copied to get an idea of their contents?

How many sequences are in the file SRR030257_1.fastq?

How many bases long are the reads in SRR030257_1.fastq?

Converting sequence file formats

Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky?

The bp_seqconvert.pl script that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can run it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded. We did this earlier.

Run the script without any arguments to get the help message:


bp_seqconvert.pl

Exercises

The file NC_012967.1.gbk is in Genbank format. The files SRR030257_*.fastq are in FASTQ format.

Convert NC_012967.1.gbk to FASTA format. Call the output NC_012967.1.fasta.
- Does EMBL format have sequence features (like genes) annotated?
- What information was lost by this conversion?

Extra reading

http://en.wikipedia.org/wiki/FASTQ_format

http://en.wikipedia.org/wiki/GenBank

The first portion of a Genbank file contains information about "features" of the genome, like genes. The second part contains the actual bases of the reference sequence. Therefore, a Genbank essentially has an embedded FASTA file inside it. There are also a lot of nice statistics and metadata, like the size of the sequence and its base composition in the GenBank header.

Mapping with BWA

BWA (the Burrows-Wheeler Aligner) is a fast mapping program. It's the successor to another aligner you might have used or heard of called MAQ (Mapping and Assembly with Quality).

To use BWA, we first need to load the software into our environment using TACC's module system.

Load the module:


module load bwa

Mapping tutorial BME

Mapping tools summary

Example: E. coli genome re-sequencing data

Before We Start

Data

Exercises

Converting sequence file formats

Exercises

Extra reading

Mapping with BWA