Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Objectives

In this lab, you will explore a popular fast mapper called BWA. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total).  The objectives of this lab is mainly to:

 

  1. Learn how BWA works and how to use it.

Introduction

BWA (the Burrows-Wheeler Aligner) is a fast short read aligner. It's the successor to another aligner you might have used or heard of called MAQ (Mapping and Assembly with Quality). As the name suggests, it uses the burrows-wheeler transform to perform alignment in a time and memory efficient manner.

BWA Variants

BWA has three different algorithms:

  • For reads upto 100 bp long:
    • BWA-backtrack : BWA aln/samse/sampe  

  • For reads upto 1 Mbp long:
    • BWA-SW
    • BWA-MEM : Newer! Typically faster and more accurate.

Get your data

Six raw data files have been provided for all our further RNA-seq analysis:

  • c1_r1, c1_r2, c1_r3 from the first biological condition
  • c2_r1, c2_r2, and c2_r3 from the second biological condition


Code Block
titleGet set up for the exercises
cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course_2015/bwa_exercise . &
cd bwa_exercise

Run BWA

Load the module:

Code Block
module load bwa/0.7.7 

There are multiple versions of BWA on TACC, so you might want to check which one you have loaded for when you write up your awesome publication that was made possible by your analysis of next-gen sequencing data.

Expand
titleHere are some commands that could help...
Code Block
module spider bwa
module list
bwa

...

You can see the different commands available under the bwa package from the command line help:

Code Block

...

bwa

Try to figure out how to index and map from the command line help:

Code Block
bwa

...

 

Part 1. Create a index of your reference

NO NEED TO RUN THIS NOW- YOUR INDEX HAS ALREADY BEEN BUILT!

Code Block
bwa index -a bwtsw reference/genome.fa


Part 2a. Align the samples to reference using bwa aln/samse/sampe

We will be using this set of commands (with options that you should try to figure out) in this order, on each sample:

...

    
    bwa aln
    bwa samse or sampe

...

What's going on at each step?

Remember to use the option that enables multithreading, if there is one, for each BWA command.

First, run the index command (index) on the reference file. This is fast, so you can run it interactively.

BWA doesn't give you a choice of where to create your index files. It creates them in the same directory as the FASTA that you input. So copy the FASTA in your intro_to_mapping directory to your new bwa directory:

Expand
titleFor example, using this command...
Code Block
cp NC_012967.1.fasta bwa

Then, run the index command using the copied FASTA as input.

Expand
titleHere's the full command line if you can't figure it out...
Code Block
bwa index bwa/NC_012967.1.fasta

Take a look at your output directory using ls bwa to see what new files appear after indexing.

...

Let's submit the bwa aln job

Warning
titleSubmit to the TACC queue or run in an idev shell

Create a commands file and use launcher_creator.py followed by qsub.

Expand

nano commands.bwa

 

Put this in your commands file:

bwa aln -f GSM794483_C1_R1_1.sai reference/genome.fa data/GSM794483_C1_R1_1.fq

bwa aln -f GSM794483_C1_R1_2.sai reference/genome.fa data/GSM794483_C1_R1_2.fq

bwa aln -f GSM794484_C1_R2_1.sai reference/genome.fa data/GSM794484_C1_R2_1.fq

bwa aln -f GSM794484_C1_R2_2.sai reference/genome.fa data/GSM794484_C1_R2_2.fq

bwa aln -f GSM794485_C1_R3_1.sai reference/genome.fa data/GSM794485_C1_R3_1.fq

bwa aln -f GSM794485_C1_R3_2.sai reference/genome.fa data/GSM794485_C1_R3_2.fq

bwa aln -f GSM794486_C2_R1_1.sai reference/genome.fa data/GSM794486_C2_R1_1.fq

bwa aln -f GSM794486_C2_R1_2.sai reference/genome.fa data/GSM794486_C2_R1_2.fq

bwa aln -f GSM794487_C2_R2_1.sai reference/genome.fa data/GSM794487_C2_R2_1.fq

bwa aln -f GSM794487_C2_R2_2.sai reference/genome.fa data/GSM794487_C2_R2_2.fq

bwa aln -f GSM794488_C2_R3_1.sai reference/genome.fa data/GSM794488_C2_R3_1.fq

bwa aln -f GSM794488_C2_R3_2.sai reference/genome.fa data/GSM794488_C2_R3_2.fq

 

Expand
titleUse this Launcher_creator command

launcher_creator.py -n aln -t 04:00:00 -j commands.bwa -q normal -a CCBB -m "module load bwa/0.7.7" -l bwa_launcher.sge

 *.sai file is a file containing "alignment seeds" in a file format specific to BWA.  We still need to extend these seed matches into alignments of entire reads, choose the best matches, and convert the output to SAM format. Do we use sampe or samse?

Lets submit the bwa sampe job, but have it be on hold till previous job is finished.

Warning
titleSubmit to the TACC queue or run in an idev shell

Create a commands file and use launcher_creator.py followed by qsub.

Expand
titleI need some help figuring out the options...

nano commands.bwa.sampe

 

Put this in your commands file:

...

bwa

...

Why did we use -t 6 instead of -t 12 for multithreading? Both of our commands are going to go to a single node on Lonestar, so they should share the 12 available cores.

Again, take a look at your output directory using ls bwa to see what new files have appeared. What is a *.sai file? It's a file containing "alignment seeds" in a file format specific to BWA. Many programs produce this kind of "intermediate" file in their own format and then at the end have tools for converting things to a "community" format shared by many downstream programs.

We still need to extend these seed matches into alignments of entire reads, choose the best matches, and convert the output to SAM format.

...

sampe -f C1_R1.sam reference/genome.fa GSM794483_C1_R1_1.sai GSM794483_C1_R1_2.sai data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq

bwa sampe -f C1_R2.sam reference/genome.fa GSM794484_C1_R2_1.sai GSM794484_C1_R2_2.sai data/GSM794484_C1_R2_1.fq data/GSM794484_C1_R2_2.fq

bwa sampe -f C1_R3.sam reference/genome.fa GSM794485_C1_R3_1.sai GSM794485_C1_R3_2.sai data/GSM794485_C1_R3_1.fq data/GSM794485_C1_R3_2.fq

bwa sampe -f C2_R1.sam reference/genome.fa GSM794486_C2_R1_1.sai GSM794486_C2_R1_2.sai data/GSM794486_C2_R1_1.fq data/GSM794486_C2_R1_2.fq

bwa sampe -f C2_R2.sam reference/genome.fa GSM794487_C2_R2_1.sai GSM794487_C2_R2_2.sai data/GSM794487_C2_R2_1.fq data/GSM794487_C2_R2_2.fq

bwa sampe -f C2_R3.sam reference/genome.fa GSM794488_C2_R3_1.sai GSM794488_C2_R3_2.sai data/GSM794488_C2_R3_1.fq data/GSM794488_C2_R3_2.fq


Part 2b. Align the samples to reference using bwa mem

Alternatively, lets also try running alignment using the newest and greatest, BWA MEM. Alignment is just one single step with bwa mem.

 

Warning
titleSubmit to the TACC queue or run in an idev shell

Create a commands file and use launcher_creator.py followed by qsub.

Expand
titleI need some help figuring out the options...

Put this in your commands file:

Code Block
nano commands.mem
 
bwa sampe -f bwa/SRR030257.sam bwa/NC_012967.1.fasta bwa/SRR030257_1.sai bwa/SRR030257_2.sai SRR030257_1.fastq SRR030257_2.fastq
run BWA pipeline
 mem reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq > C1_R1.mem.sam
bwa mem reference/genome.fa data/GSM794484_C1_R2_1.fq data/GSM794484_C1_R2_2.fq > C1_R2.mem.sam
bwa mem reference/genome.fa data/GSM794485_C1_R3_1.fq data/GSM794485_C1_R3_2.fq > C1_R3.mem.sam
bwa mem reference/genome.fa data/GSM794486_C2_R1_1.fq data/GSM794486_C2_R1_2.fq > C2_R1.mem.sam
bwa mem reference/genome.fa data/GSM794487_C2_R2_1.fq data/GSM794487_C2_R2_2.fq > C2_R2.mem.sam
bwa mem reference/genome.fa data/GSM794488_C2_R3_1.fq data/GSM794488_C2_R3_2.fq > C2_R3.mem.sam

Since these will take a while to run, you can look at already generated results at: /corral-repl/utexas/BioITeam/rnaseq_course_2015/bwa_exercise/results/bwa 

 Help! I have a lots of reads and a large number of reads. Make BWA go faster!

  • Use threading option in the bwa command ( bwa -t <number of threads>)

  • Split one data file into smaller chunks and run multiple instances of bwa. Finally concatenate the output.
    • WAIT! We have a pipeline for that!
    • Look for runBWA.sh in $BI/bin  (it should be in your path)

Now that we are done mapping, lets look at how to assess mapping results.