Get your data
Six raw data files were provided as the starting point:
...
Due to the size of the data and length of run time, many of the programs have already been run for this exercise. We will submit some jobs, but because we cannot wait for it to complete, we will be looking through already generated results. You will then be parsing the output, finding answers, and visualizing results (in the directory tophat_results).
Run tophat
On lonestar, to run tophat, following modules need to be loaded.
| Code Block | ||
|---|---|---|
| ||
module load boost/1.45.0 module load bowtie module load tophat |
Create reference index - this has already been done for you.
bowtie2-build reference/genome.fa reference/genome.fa
Notice the index files in the reference/ directory.
Tophat command syntax
tophat [options] <bowtie_index_prefix> <reads1> <reads2>
...
| Code Block | ||
|---|---|---|
| ||
qsub tophat_launcher.sge |
Examine tophat parameters
Why did we choose the tophat parameters we did and what do they mean? Here's our tophat command again:
...
As you can see there are many many other options for running tophat!
Examine tophat results
We are going to be running commands that can take a couple of minutes to run. But with 20 people running the same commands, we don't want to run these on the head node. They are too small to be submitted too. So, let's switch over to the tab with our idev session.
...
Let's parse the bam file to pull out just spliced alignments.
...
| Code Block | ||||
|---|---|---|---|---|
| at the CIGAR string
| |||
samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|headThe |
The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:
- 66M - the first 66 bases match the reference
- 76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
- 9M - the last 17 bases match the reference
Exercise 2b: What is the read ID for one of these reads with spliced alignment?
...
| Expand | ||||
|---|---|---|---|---|
| ||||
|
Exercise 4: Samtools flagstat
Lets run samtools flagstat on our tophat bam file:
| Code Block |
|---|
samtools flagstat accepted_hits.bam
|
Let's see how this compares to BWA results...
| Code Block | ||
|---|---|---|
| ||
cd $SCRATCH/my_rnaseq_course/bwa_mem_results |
...
Exercise 45: Count spliced sequences in BWA results
How many spliced sequences are there in one of the alignment files, C1_R1.mem.bam?
...
| Expand | ||||
|---|---|---|---|---|
| ||||
|
Exercise 56: How does a read with tophat spliced alignment look in the BWA results?
...
| Expand | ||||
|---|---|---|---|---|
| ||||
|
Help! I have a lots of reads and a large number of reads. Make tophat go faster!
Use threading option efficiently (tophat -p <number of threads>)
- Split one data file into smaller chunks and run multiple instances of tophat. Finally concatenate the output.
- WAIT! We have a pipeline for that!
- Look for fastTophat in $BI/scripts
- Split mapping by chromosome- mapping to each chromosome=1 job.
BACK TO COURSE OUTLINE