Mapping tutorial (bowtie2, bwa) (GVA14)

Overview

The first step in nearly every next-gen sequence analysis pipeline is to map sequencing reads to a reference genome. In this tutorial we'll run some common mapping tools on TACC.

The world of read mappers seems to be settling down a bit after being a bioinformatics Wild West where there was a new gun in town every week that promised to be a faster and more accurate shot than the current record holder. Things seem to have reached the point where there is mainly a trade-off between speed, accuracy, and configurability among read mappers that have remained popular.

There are over 50 read mapping programs listed here. We're going to (mainly) stick to just two or three in this course.

Each mapper has its own set of limitations (on the lengths of reads it accepts, on how it outputs read alignments, on how many mismatches there can be, on whether it produces gapped alignments, on whether it supports SOLiD colorspace data, etc.).

Learning Objectives

This tutorial covers the commands necessary to use several common read mapping programs.

Become comfortable with the basic steps of indexing a reference genome, mapping reads, and converting output to SAM/BAM format for downstream analysis.
Use bowtie2 and BWA to map reads from an E. coli Illumina data set to a reference genome and compare the output.

Other read mappers

A previous version of this tutorial can be used to see the commands for using bowtie.

Theory

Please see the for more details of the theory behind read mapping algorithms and critical considerations for using these tools correctly.

1 Overview
2 Learning Objectives
3 Theory
4 Table of Contents
5 Mapping tools summary
6 Example: E. coli genome re-sequencing data
- 6.1 Data
  - 6.1.1 Exercises
- 6.2 Converting sequence file formats
  - 6.2.1 Exercises
  - 6.2.2 Extra reading
7 Mapping with bowtie2
- 7.1 More reading about SAM files
- 7.2 Multithreaded execution
8
9 Mapping with BWA
10 Exercises
11 From here...

Mapping tools summary

The three tools that we show detailed instructions for in this tutorial and their versions currently available on the Lonestar cluster at TACC:

Tool	TACC	Version	Download	Manual	Example
Bowtie2	`module load bowtie/2.0.2`	2.0.2	link	link	#Bowtie2
BWA	`module load bwa/0.6.2`	0.6.1; 0.6.2	link	link	#BWA

Modules also exist at the current time for: bowtie and SHRiMP.

Example: E. coli genome re-sequencing data

The following DNA sequencing read data files were downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment.

Data

We have already downloaded data files for this example and put them in the path:

$BI/gva_course/mapping/data

File Name	Description	Sample
`SRR030257_1.fastq`	Paired-end Illumina, First of pair, FASTQ format	Re-sequenced E. coli genome
`SRR030257_2.fastq`	Paired-end Illumina, Second of pair, FASTQ format	Re-sequenced E. coli genome
`NC_012967.1.gbk`	Reference Genome in Genbank format	E. coli B strain REL606

The easiest way to run the tutorial is to copy this entire directory to your $SCRATCH space and then run all of the commands from inside that directory. See if you can figure out how to do that. When you're in the right place, you should get output like this from the ls command.

login1$ ls
NC_012967.1.gbk  SRR030257_1.fastq  SRR030257_2.fastq

cds
cp -r $BI/gva_course/mapping/data mapping
cd mapping

Exercises

What basic linux commands could help us quickly peek at the files that were just copied to get an idea of their contents?

How many sequences are in the file SRR030257_1.fastq?

How many bases long are the reads in SRR030257_1.fastq?

Converting sequence file formats

Occasionally you might download a sequence or have it emailed to you by a collaborator in one format, and then the program that you want to use demands that it be in another format. Why do they have to be so picky?

The bp_seqconvert.pl script that is installed as part of Bioperl is one helpful utility for converting between many common sequence formats. On TACC, the Bioperl modules are installed, but the helper script isn't. So, we've put it in a place that you can run it from for your convenience. However, remember that any time that you use the script you must have the bioperl module loaded. We also took care of this for you when we edited your ~/.profile_user file in the Linux introduction.

Run the script without any arguments to get the help message:

bp_seqconvert.pl

Exercises

The file NC_012967.1.gbk is in Genbank format. The files SRR030257_*.fastq are in FASTQ format.

Convert NC_012967.1.gbk to EMBL format. Call the output NC_012967.1.embl.
- Does EMBL format have sequence features (like genes) annotated?
Convert only the first 10,000 lines of SRR030257_1.fastq to FASTA format.
- What information was lost by this conversion?

Extra reading

http://en.wikipedia.org/wiki/FASTQ_format

http://en.wikipedia.org/wiki/GenBank

The first portion of a Genbank file contains information about "features" of the genome, like genes. The second part contains the actual bases of the reference sequence. Therefore, a Genbank essentially has an embedded FASTA file inside it. There are also a lot of nice statistics and metadata, like the size of the sequence and its base composition in the GenBank header.

Mapping tutorial (bowtie2, bwa) (GVA14)

Overview

Learning Objectives

Theory

Table of Contents

Mapping tools summary

Example: E. coli genome re-sequencing data

Data

Exercises

Converting sequence file formats

Exercises

Extra reading

Mapping with bowtie2