Mapping with Tophat Exercises 2014

Get your data

Six raw data files were provided as the starting point:

c1_r1, c1_r2, c1_r3 from the first biological condition
c2_r1, c2_r2, and c2_r3 from the second biological condition

Get set up for the exercises

cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . &
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_results . & #pregenerated tophat results
#Lets get pregenerated bwa results as well for doing comparisons
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/bwa_mem_results . &
 
cd tophat_exercise

Due to the size of the data and length of run time, many of the programs have already been run for this exercise. We will submit some jobs, but because we cannot wait for it to complete, we will be looking through already generated results. You will then be parsing the output, finding answers, and visualizing results (in the directory tophat_results).

Run tophat

On lonestar, to run tophat, following modules need to be loaded.

Load modules

module load boost/1.45.0
module load bowtie
module load tophat

Create reference index - this has already been done for you.

bowtie2-build reference/genome.fa reference/genome.fa

Notice the index files in the reference/ directory.

Tophat command syntax

     tophat [options] <bowtie_index_prefix> <reads1> <reads2>

Create a tophat.commands file with all your tophat commands.

Tophat commands

nano commands.tophat
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq
tophat -p 2 -G reference/genes.gtf -o C1_R2_thout reference/genome.fa data/GSM794484_C1_R2_1.fq data/GSM794484_C1_R2_2.fq
tophat -p 2 -G reference/genes.gtf -o C1_R3_thout reference/genome.fa data/GSM794485_C1_R3_1.fq data/GSM794485_C1_R3_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R1_thout reference/genome.fa data/GSM794486_C2_R1_1.fq data/GSM794486_C2_R1_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R2_thout reference/genome.fa data/GSM794487_C2_R2_1.fq data/GSM794487_C2_R2_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R3_thout reference/genome.fa data/GSM794488_C2_R3_1.fq data/GSM794488_C2_R3_2.fq

Create a launcher file called tophat_launcher.sge to submit these tophat commands.

Create a launcher file called tophat_launcher.sge

module load python
launcher_creator.py -j commands.tophat -n tophat -q normal -t 08:00:00 -a CCBB -l tophat_launcher.sge

Submit your tophat job

Create a launcher file called tophat_launcher.sge

qsub tophat_launcher.sge

Examine tophat parameters

Why did we choose the tophat parameters we did and what do they mean? Here's our tophat command again:

tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq

Exercise 1a: Why did we use the -G flag?

Hints

Remember that you can type the command and hit enter to get help info about it.

Solution

The -G <GTF filename> to use the annotated splice junctions in the supplied GTF file

Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?

Solution

Specify -G <gtf filename> to have tophat use the annotated transcripts in the supplied GTF file
Also add the --no-novel-juncs option to suppress de novo junction detection

Exercise 1c: Why did we use the -p flag? And why did we choose -p 2 specifically?

As you can see there are many many other options for running tophat!

Examine tophat results

We are going to be running commands that can take a couple of minutes to run. But with 20 people running the same commands, we don't want to run these on the head node. They are too small to be submitted too. So, let's switch over to the tab with our idev session.

When following along here, please switch to your idev session for running these example commands.

Get tophat results

cd ../tophat_results

Take a minute to look at the output files produced by one tophat run.

cd C1_R1_thout
ls -l

-rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam
-rw------- 1 daras G-801020  563 May 16 23:35 align_summary.txt
-rw------- 1 daras G-801020   52 May 16 23:35 deletions.bed
-rw------- 1 daras G-801020   54 May 16 23:35 insertions.bed
-rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed
drwx------ 2 daras G-801020  32K May 16 23:35 logs
-rw------- 1 daras G-801020  184 May 16 23:35 prep_reads.info
-rw------- 1 daras G-801020  442 May 16 23:35 unmapped.bam

Exercise 1: Examine a BAM file

Examine a few lines of the C1_R1 alignment file.

Lets try this first

head accepted_hits.bam

Let's try this next

samtools view -x accepted_hits.bam | head  

#make sure you are in tophat_results/C1_R1_thout to run this

Exercise 2a: Spliced sequences - Find a spliced alignment.

How is a spliced sequence represented in the BAM file?

Which field in our file contains the cigar score?

Finding Cigar score field

samtools view -x accepted_hits.bam | head

Let's parse the bam file to pull out just spliced alignments.

Looking for spliced alignments

 samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head

The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:

66M - the first 66 bases match the reference
76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
9M - the last 17 bases match the reference

Exercise 2b: What is the read ID for one of these reads with spliced alignment?

Pick your favorite and remember it. I have picked read 9 with the cigar score 66M76N9M.

Exercise 3: Count spliced sequences

How many spliced sequences are there in the C1_R1 alignment file?

Hint

samtools view and cut and grep

How to

samtools view accepted_hits.bam | cut -f 6 | grep 'N' | wc -l

Exercise 4: Samtools flagstat

Lets run samtools flagstat on our tophat bam file:

samtools flagstat accepted_hits.bam

Let's see how this compares to BWA results...

Back to the directory with BWA results

cd $SCRATCH/my_rnaseq_course/bwa_mem_results

Exercise 5: Count spliced sequences in BWA results

How many spliced sequences are there in one of the alignment files, C1_R1.mem.bam?

Hint

samtools view and cut and grep

How to

samtools view C1_R1.mem.bam| cut -f 6 | grep 'N' | wc -l

Exercise 6: How does a read with tophat spliced alignment look in the BWA results?

Let's how read 9 looks in BWA results. Here's how it looks in tophat results (picking out only the first 9 columns in the sam file):

9 99 2L 7954 50 75M = 8051

9 147 2L 8051 50 66M76N9M = 7954

How to

samtools view C1_R1.mem.bam| awk '{if ($1 == 9) print}'

Help! I have a lots of reads and a large number of reads. Make tophat go faster!

Use threading option efficiently (tophat -p <number of threads>)
Split one data file into smaller chunks and run multiple instances of tophat. Finally concatenate the output.
- WAIT! We have a pipeline for that!
- Look for fastTophat in $BI/scripts
Split mapping by chromosome- mapping to each chromosome=1 job.

BACK TO COURSE OUTLINE