Mapping with Tophat Exercises 2014
Get your data
Six raw data files were provided as the starting point:
c1_r1, c1_r2, c1_r3 from the first biological condition
c2_r1, c2_r2, and c2_r3 from the second biological condition
Get set up for the exercises
cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . &
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_results . & #pregenerated tophat results
#Lets get pregenerated bwa results as well for doing comparisons
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/bwa_mem_results . &
cd tophat_exercise
Due to the size of the data and length of run time, many of the programs have already been run for this exercise. We will submit some jobs, but because we cannot wait for it to complete, we will be looking through already generated results. You will then be parsing the output, finding answers, and visualizing results (in the directory tophat_results).
Run tophat
On lonestar, to run tophat, following modules need to be loaded.
Load modules
module load boost/1.45.0
module load bowtie
module load tophat
Create reference index - this has already been done for you.
bowtie2-build reference/genome.fa reference/genome.fa
Notice the index files in the reference/ directory.
Tophat command syntax
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
Create a tophat.commands file with all your tophat commands.
Tophat commands
nano commands.tophat
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq
tophat -p 2 -G reference/genes.gtf -o C1_R2_thout reference/genome.fa data/GSM794484_C1_R2_1.fq data/GSM794484_C1_R2_2.fq
tophat -p 2 -G reference/genes.gtf -o C1_R3_thout reference/genome.fa data/GSM794485_C1_R3_1.fq data/GSM794485_C1_R3_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R1_thout reference/genome.fa data/GSM794486_C2_R1_1.fq data/GSM794486_C2_R1_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R2_thout reference/genome.fa data/GSM794487_C2_R2_1.fq data/GSM794487_C2_R2_2.fq
tophat -p 2 -G reference/genes.gtf -o C2_R3_thout reference/genome.fa data/GSM794488_C2_R3_1.fq data/GSM794488_C2_R3_2.fqCreate a launcher file called tophat_launcher.sge to submit these tophat commands.
Create a launcher file called tophat_launcher.sge
module load python
launcher_creator.py -j commands.tophat -n tophat -q normal -t 08:00:00 -a CCBB -l tophat_launcher.sgeSubmit your tophat job
Create a launcher file called tophat_launcher.sge
qsub tophat_launcher.sgeExamine tophat parameters
Why did we choose the tophat parameters we did and what do they mean? Here's our tophat command again:
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq
Exercise 1a: Why did we use the -G flag?
Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?
Exercise 1c: Why did we use the -p flag? And why did we choose -p 2 specifically?
As you can see there are many many other options for running tophat!
Examine tophat results
We are going to be running commands that can take a couple of minutes to run. But with 20 people running the same commands, we don't want to run these on the head node. They are too small to be submitted too. So, let's switch over to the tab with our idev session.
When following along here, please switch to your idev session for running these example commands.
Get tophat results
cd ../tophat_results
Take a minute to look at the output files produced by one tophat run.
cd C1_R1_thout
ls -l
-rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam
-rw------- 1 daras G-801020 563 May 16 23:35 align_summary.txt
-rw------- 1 daras G-801020 52 May 16 23:35 deletions.bed
-rw------- 1 daras G-801020 54 May 16 23:35 insertions.bed
-rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed
drwx------ 2 daras G-801020 32K May 16 23:35 logs
-rw------- 1 daras G-801020 184 May 16 23:35 prep_reads.info
-rw------- 1 daras G-801020 442 May 16 23:35 unmapped.bam
Exercise 1: Examine a BAM file
Examine a few lines of the C1_R1 alignment file.
Lets try this first
head accepted_hits.bamLet's try this next
samtools view -x accepted_hits.bam | head
#make sure you are in tophat_results/C1_R1_thout to run thisExercise 2a: Spliced sequences - Find a spliced alignment.
How is a spliced sequence represented in the BAM file?
Which field in our file contains the cigar score?
Finding Cigar score field
samtools view -x accepted_hits.bam | head
Let's parse the bam file to pull out just spliced alignments.
Looking for spliced alignments
samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|headThe CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:
66M - the first 66 bases match the reference
76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
9M - the last 17 bases match the reference
Exercise 2b: What is the read ID for one of these reads with spliced alignment?
Pick your favorite and remember it. I have picked read 9 with the cigar score 66M76N9M.
Exercise 3: Count spliced sequences
How many spliced sequences are there in the C1_R1 alignment file?
Exercise 4: Samtools flagstat
Lets run samtools flagstat on our tophat bam file:
samtools flagstat accepted_hits.bam
Let's see how this compares to BWA results...
Back to the directory with BWA results
cd $SCRATCH/my_rnaseq_course/bwa_mem_resultsExercise 5: Count spliced sequences in BWA results
How many spliced sequences are there in one of the alignment files, C1_R1.mem.bam?
Exercise 6: How does a read with tophat spliced alignment look in the BWA results?
Let's how read 9 looks in BWA results. Here's how it looks in tophat results (picking out only the first 9 columns in the sam file):
9 99 2L 7954 50 75M = 8051
9 147 2L 8051 50 66M76N9M = 7954
Help! I have a lots of reads and a large number of reads. Make tophat go faster!
Use threading option efficiently (tophat -p <number of threads>)
Split one data file into smaller chunks and run multiple instances of tophat. Finally concatenate the output.
WAIT! We have a pipeline for that!
Look for fastTophat in $BI/scripts
Split mapping by chromosome- mapping to each chromosome=1 job.
BACK TO COURSE OUTLINE