Mapping tutorial
Overview
The first step in nearly every next-gen sequence analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll run some common mapping tools on TACC.
The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular.
There are over 50 read mapping programs listed here. We're going to (mainly) stick to just two or three in this course.
Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.).
Learning Objectives
This tutorial covers the commands necessary to use several common read mapping programs.
- Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to
SAM/BAM
format for downstream analysis. - Use
bowtie
,bwa
, andbowtie2
on an E. coli Illumina data set.
Theory
Please see the Introduction to mapping presentation for more details of the theory behind read mapping algorithms and critical considerations for using these tools correctly.
Table of Contents
Mapping tools summary
The three tools that we show detailed instructions for in this tutorial and their versions currently available on the Lonestar cluster at TACC:
Tool |
TACC |
Version |
Download |
Manual |
Example |
---|---|---|---|---|---|
Bowtie |
module load bowtie/0.12.8 |
0.12.8 |
|||
BWA |
module load bwa/0.6.2 |
0.6.1; 0.6.2 |
|||
Bowtie2 |
module load bowtie/2.0.2 |
2.0.2 |
Modules also exist at the current time for: SHRiMP
and SOAP
.
Example: E. coli genome re-sequencing data
The following DNA sequencing read data files were downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment.
Data
We have already downloaded data files for this example and put them in the path:
$BI/ngs_course/intro_to_mapping/data
File Name |
Description |
Sample |
---|---|---|
|
Paired-end Illumina, First of pair, FASTQ format |
Re-sequenced E. coli genome |
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache. If you require further assistance, please email wikihelp@utexas.edu.