Six raw data files were provided as the starting point:
cds cd my_rnaseq_course cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . & cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_results . & #pregenerated tophat results #Lets get pregenerated bwa results as well for doing comparisons cp -r /corral-repl/utexas/BioITeam/rnaseq_course/bwa_mem_results . & cd tophat_exercise |
Due to the size of the data and length of run time, many of the programs have already been run for this exercise. We will submit some jobs, but because we cannot wait for it to complete, we will be looking through already generated results. You will then be parsing the output, finding answers, and visualizing results (in the directory tophat_results).
On lonestar, to run tophat, following modules need to be loaded.
module load boost/1.45.0 module load bowtie module load tophat |
Create reference index - this has already been done for you.
bowtie2-build reference/genome.fa reference/genome.fa
Notice the index files in the reference/ directory.
Tophat command syntax
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
Create a tophat.commands file with all your tophat commands.
nano commands.tophat tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq tophat -p 2 -G reference/genes.gtf -o C1_R2_thout reference/genome.fa data/GSM794484_C1_R2_1.fq data/GSM794484_C1_R2_2.fq tophat -p 2 -G reference/genes.gtf -o C1_R3_thout reference/genome.fa data/GSM794485_C1_R3_1.fq data/GSM794485_C1_R3_2.fq tophat -p 2 -G reference/genes.gtf -o C2_R1_thout reference/genome.fa data/GSM794486_C2_R1_1.fq data/GSM794486_C2_R1_2.fq tophat -p 2 -G reference/genes.gtf -o C2_R2_thout reference/genome.fa data/GSM794487_C2_R2_1.fq data/GSM794487_C2_R2_2.fq tophat -p 2 -G reference/genes.gtf -o C2_R3_thout reference/genome.fa data/GSM794488_C2_R3_1.fq data/GSM794488_C2_R3_2.fq |
Create a launcher file called tophat_launcher.sge to submit these tophat commands.
module load python launcher_creator.py -j commands.tophat -n tophat -q normal -t 08:00:00 -a CCBB -l tophat_launcher.sge |
Submit your tophat job
qsub tophat_launcher.sge |
Why did we choose the tophat parameters we did and what do they mean? Here's our tophat command again:
tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq
Exercise 1a: Why did we use the -G flag?
Remember that you can type the command and hit enter to get help info about it. |
The -G <GTF filename> to use the annotated splice junctions in the supplied GTF file |
Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?
Specify -G <gtf filename> to have tophat use the annotated transcripts in the supplied GTF file |
Exercise 1c: Why did we use the -p flag? And why did we choose -p 2 specifically?
As you can see there are many many other options for running tophat!
We are going to be running commands that can take a couple of minutes to run. But with 20 people running the same commands, we don't want to run these on the head node. They are too small to be submitted too. So, let's switch over to the tab with our idev session.
When following along here, please switch to your idev session for running these example commands. |
cd ../tophat_results |
cd C1_R1_thout ls -l -rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam -rw------- 1 daras G-801020 563 May 16 23:35 align_summary.txt -rw------- 1 daras G-801020 52 May 16 23:35 deletions.bed -rw------- 1 daras G-801020 54 May 16 23:35 insertions.bed -rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed drwx------ 2 daras G-801020 32K May 16 23:35 logs -rw------- 1 daras G-801020 184 May 16 23:35 prep_reads.info -rw------- 1 daras G-801020 442 May 16 23:35 unmapped.bam |
Exercise 1: Examine a BAM file
Examine a few lines of the C1_R1 alignment file.
head accepted_hits.bam |
samtools view -x accepted_hits.bam | head #make sure you are in tophat_results/C1_R1_thout to run this |
Exercise 2a: Spliced sequences - Find a spliced alignment.
How is a spliced sequence represented in the BAM file?
![]()
Which field in our file contains the cigar score?
samtools view -x accepted_hits.bam | head |
Let's parse the bam file to pull out just spliced alignments.
samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head |
The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:
Exercise 2b: What is the read ID for one of these reads with spliced alignment?
Pick your favorite and remember it. I have picked read 9 with the cigar score 66M76N9M.
Exercise 3: Count spliced sequences
How many spliced sequences are there in the C1_R1 alignment file?
samtools view and cut and grep |
|
Exercise 4: Samtools flagstat
Lets run samtools flagstat on our tophat bam file:
samtools flagstat accepted_hits.bam |
cd $SCRATCH/my_rnaseq_course/bwa_mem_results |
Exercise 5: Count spliced sequences in BWA results
How many spliced sequences are there in one of the alignment files, C1_R1.mem.bam?
samtools view and cut and grep |
|
Exercise 6: How does a read with tophat spliced alignment look in the BWA results?
Let's how read 9 looks in BWA results. Here's how it looks in tophat results (picking out only the first 9 columns in the sam file):
9 99 2L 7954 50 75M = 8051
9 147 2L 8051 50 66M76N9M = 7954
|
Use threading option efficiently (tophat -p <number of threads>)
BACK TO COURSE OUTLINE