Read mapping with BWA and BOWTIE

Before We Start

In order to save a lot of typing, and to allow us some flexibility in designing these courses, we will establish a UNIX shell variable BASE to point to the current filesystem location of the TACC NGS tutorial material. For any shell you open that accesses Lonestar during today's tutorial, please enter the following command:

export BASE=/corral-repl/utexas/BioITeam/tacc_ngs

Read Mapping Tutorial

Your objective today is to map the paired-end sequence data in these files to a reference sequence, then convert the results to a sorted, indexed BAM file. You can use BWA or Bowtie to do the alignment, and SAMtools facilitate the other tasks. In the latter half of today's work, you will use these results files to call some variants in the human genome.

Data Files
$BASE/human_variation/allseqs_R1.fastq
$BASE/human_variation/allseqs_R2.fastq
The Reference Sequence
$BASE/human_variation/ref/hs37d5.fa

BWA and Bowtie (as well as most other throughput-oriented aligners) require that you provide not just the reference sequence, but an index of that sequence as well. These can be generated by bwa index or bowtie-build, respectively. To save time, we have provided indices formatted for BWA and Bowtie.

Look in the $BASE/human_variation/ref to see additional index files besides hs37d5.fa

 Solution
login1$ ls $BASE/human_variation/ref/
hs37d5.fa	  hs37d5.fa.3.ebwt  hs37d5.fa.ann  hs37d5.fa.rev.1.ebwt
hs37d5.fa.1.ebwt  hs37d5.fa.4.ebwt  hs37d5.fa.bwt  hs37d5.fa.rev.2.ebwt
hs37d5.fa.2.ebwt  hs37d5.fa.amb     hs37d5.fa.pac  hs37d5.fa.sa

These additional files are generated by the index commands. You can index your own reference sequence very easily, but be warned it may take a few hours so be sure to do it using a batch script.

See if you can find some other modules on Lonestar that pertain to Alignment

 Solution
login1$ module key Alignment
----------------------------------------------------------------------------
The following modules match your search criteria: "Alignment"
----------------------------------------------------------------------------

  beast: beast/1.7, beast/1.7.2
    Tool for Bayesian MCMC analysis of molecular sequences

  bismark: bismark/0.7.2, bismark/0.7.4
    bismark - A tool to map bisulfite converted sequence reads and
    determine cytosine

  blast: blast/2.2.26
    NCBI BLAST+ sequence alignment package. The program compares nucleotide
    or protein sequences to sequence databases and calculates the
    statistical significance of matches.

  bowtie: bowtie/0.12.8, bowtie/2.0.0b6
    Ultrafast, memory-efficient short read aligner

  bwa: bwa/0.6.1, bwa/0.6.2
    bwa - Burrows-Wheeler Alignment Tool

  clustalw2: clustalw2/2.1
    Tool for multiple sequence alignment

  gsnap: gsnap/20120320, gsnap/20120620
    gsnap - Genomic Short-read Nucleotide Alignment Program

  mafft: mafft/6.864, mafft/6.903
    Multiple alignment program for amino acid or nucleotide sequences

  mummer: mummer/3.23
    MUMmer - A modular system for the rapid whole genome alignment of
    finished or draft sequence

  plink: plink/1.07
    plink - Whole genome association analysis toolset

  tophat: tophat/1.4.1, tophat/2.0.4
    TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns
    RNA-Seq reads to mammalian-sized genomes using the ultra
    high-throughput short read aligner Bowtie, and then analyzes the
    mapping results to identify splice junctions between exons.

Obviously, the word Alignment pops up the secription of a few non-NGS modules, but you can see the utility in modules' keyword search capabilities.

Submit some alignment jobs to Lonestar

First, let's create a working directory: mkdir $WORK/bwa-align && cd $WORK/bwa-align. Now, we will demonstrate how to submit a BWA alignment job to Lonestar. It's up to you to figure out how to do the same for Bowtie. Using a text editor, create the following file. Feel free to copy and paste or copy it from $BASE/align_bwa_01.sh

 Hint
cp $BASE/align_bwa_01.sh .
align_bwa_01.sh
#!/bin/bash

#$ -V
#$ -cwd
#$ -pe 1way 12
#$ -q normal
#$ -l h_rt=01:00:00
#$ -A 20121008-NGS-ACES
#$ -m be
#$ -M vaughn@tacc.utexas.edu
#$ -N align_bwa_01

module load bwa/0.6.1

BASE="/corral-repl/utexas/BioITeam/tacc_ngs"

time bwa aln $BASE/human_variation/ref/hs37d5.fa $BASE/human_variation/allseqs_R1.fastq > r1.sai
time bwa aln $BASE/human_variation/ref/hs37d5.fa $BASE/human_variation/allseqs_R2.fastq > r2.sai

time bwa sampe $BASE/human_variation/ref/hs37d5.fa r1.sai r2.sai $BASE/human_variation/allseqs_R1.fastq $BASE/human_variation/allseqs_R2.fastq > hs37d5_allseqs_bwa.sam

Once you've edited the job file, submit it to Lonestar's queue, then check the status of the job to make sure it went in.

Submit your job file using qsub

 Solution
login1$ qsub align_bwa_01.sh
(... A large amount of text will scroll by
	followed by text resembling this ...)
Your job 773416 ("align_bwa_01") has been submitted

Check the status of your job in the queue

 Solution
login1$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 773416 0.64564 align_bwa_ vaughn       r     10/03/2012 19:00:22 normal@c335-301.ls4.tacc.utexa    12

Now, figure out how to align the same sequences using Bowtie

 Hint

The specific commands for Bowtie, which you will need to work into your own job file is

module load bowtie/0.12.8
...
time bowtie --chunkmbs 256 -x -t -S $BASE/human_variation/ref/hs37d5.fa -1 $BASE/human_variation/allseqs_R1.fastq -2 $BASE/human_variation/allseqs_R2.fastq hs37d5_allseqs_bowtie.sam

Notice that Bowtie can handle paired-end mapping and conversion to SAM all in one command! If you run into trouble creating a job file for Bowtie from scratch, you can copy over ours to use as a template.

cp /corral-repl/utexas/BioITeam/tacc_ngs/align_bowtie_01.sh .

You're running two computing jobs out of the same directory. This will work, but you need to make sure that none of the files created by the two tasks have the same name or both will end up failing or reporting erroneous results. In our case, we've assured this by using two different algorithms (BWA and Bowtie) and by choosing different output names hs37d5_allseqs_bwa.sam and hs37d5_allseqs_bowtie.sam

Setting up BAM conversion as a dependency

We're going to use SAMTools to convert the results of your alignment jobs from the text version of the SAM format to the more efficient and useful binary BAM format. Along the way we'll sort the reads by chromosome position and create an index of the BAM file. "But Wait!", you say, "our alignments haven't run yet!". We're going to kill two birds with one stone: We will show you the basics of setting up SAM->sorted BAm conversion and we are also going to demonstrate how to orchestrate a simple workflow using job dependency on Lonestar. This means the downstream conversion tasks won't kick off until the alignment tasks have completed successfully.

First, let's assemble the job file we need to convert the BWA output file hs37d5_allseqs_bwa.sam from SAM to BAM. Using a text editor, create the following file. Feel free to copy/paste from the Wiki or copy it from /corral-repl/utexas/BioITeam/tacc_ngs/samtools_bwa_01.sh .

Don't submit samtools_bwa_01.sh to the Lonestar queue yet until you've been shown how to establish a job dependency, or the task will fail

samtools_bwa_01.sh
#!/bin/bash

#$ -V
#$ -cwd
#$ -pe 1way 12
#$ -q normal
#$ -l h_rt=02:00:00
#$ -A 20121008-NGS-ACES
#$ -m be
#$ -M vaughn@tacc.utexas.edu
#$ -N samtools_bwa_01

# Load the samtools module
module load samtools

# Set up a variable so we don't have to
# keep typing in all this path
BASE="/corral-repl/utexas/BioITeam/tacc_ngs"

# For readability, these are listed one command per line. If you
# want to chain them all together into a set of linearly dependent
# commands, check out the samtools_bwa_02.sh file for an example
# of using the && operator to do just that
#

# samtools has a suite of sub-commands documented here
# http://samtools.sourceforge.net/samtools.shtml
#
# 'view' is used for filtering as well as simple conversion
samtools view -S -b hs37d5_allseqs_bwa.sam > hs37d5_allseqs_bwa.bam
# 'sort' lets you sort by position or read name
samtools sort hs37d5_allseqs_bwa.bam hs37d5_allseqs_bwa.sorted
# 'index' lets you create an index of a BAM file
samtools index hs37d5_allseqs_bwa.sorted.bam

Setting up a job dependency

Check on the status of your bwa job (using qstat). It's probably still running. If it is, and you do an 'ls' on the output directory, either the 'hs37d5_allseqs_bwa.sam' file will be absent or will still be in process of being written. So, if we start running the SAMtools from above assuming that file exists and is complete, we're going to have a situation. We want to tell Lonestar "Don't start running the samtools tasks until AFTER bwa has completed". To do this, you just need to know the job-ID of the BWA task.

Find out the job-ID of your BWA task

 Solution
login1$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
 773416 0.64564 align_bwa_ vaughn       r     10/03/2012 19:00:22 normal@c335-301.ls4.tacc.utexa    12

The job-ID of MY bwa task is 773416. What's yours?

Establishing a job dependency is easy. Instead of just typing:

qsub samtools_bwa_01.sh

to tell Lonestar to start running SAMTools ASAP, you tell it this instead:

qsub -hold_jid 773416 samtools_bwa_01.sh

where you would replace 773416 with the job-ID of your BWA task.

Challenge Mode

  • Create a script to process your Bowtie results using SAMtools
  • Submit it as a dependency to your Bowtie alignment task
 Hints

See samtools_bowtie_01.sh in the tacc_ngs directory if you get stuck.