Content Comparison

Get your data

Six raw data files were provided as the starting point:

...

Due to the size of the data and length of run time, many of the programs have already been run for this exercise. We will submit some jobs, but because we cannot wait for it to complete, we will be looking through already generated results. You will then be parsing the output, finding answers, and visualizing results (in the directory tophat_results).

Run tophat

On lonestar, to run tophat, following modules need to be loaded.

Code Block

title	Load modules

module load boost/1.45.0
module load bowtie
module load tophat

Create reference index - this has already been done for you.

bowtie2-build reference/genome.fa reference/genome.fa

Notice the index files in the reference/ directory.

Tophat command syntax

     tophat [options] <bowtie_index_prefix> <reads1> <reads2>

...

Code Block

title	Create a launcher file called tophat_launcher.sge

qsub tophat_launcher.sge

Examine tophat parameters

Why did we choose the tophat parameters we did and what do they mean? Here's our tophat command again:

...

As you can see there are many many other options for running tophat!

Examine tophat results

We are going to be running commands that can take a couple of minutes to run. But with 20 people running the same commands, we don't want to run these on the head node. They are too small to be submitted too. So, let's switch over to the tab with our idev session.

...

Let's parse the bam file to pull out just spliced alignments.

...

Looking at the CIGAR string at the CIGAR string

Code Block

title	Looking

for spliced alignments

 samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head

The

The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:

66M - the first 66 bases match the reference
76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
9M - the last 17 bases match the reference

Exercise 2b: What is the read ID for one of these reads with spliced alignment?

...

Expand

	How to
	How to

Code Block
samtools view accepted_hits.bam \| cut -f 6 \| grep 'N' \| wc -l

Exercise 4: Samtools flagstat

Lets run samtools flagstat on our tophat bam file:

Code Block
samtools flagstat accepted_hits.bam

Let's see how this compares to BWA results...

Code Block

title	Back to the directory with BWA results

cd $SCRATCH/my_rnaseq_course/bwa_mem_results

...

Exercise 45: Count spliced sequences in BWA results

How many spliced sequences are there in one of the alignment files, C1_R1.mem.bam?

...

Expand

	How to
	How to

Code Block
samtools view C1_R1.mem.bam\| cut -f 6 \| grep 'N' \| wc -l

Exercise 56: How does a read with tophat spliced alignment look in the BWA results?

...

Expand

	How to
	How to

Code Block
samtools view C1_R1.mem.bam\| awk '{if ($1 == 9) print}'

Help! I have a lots of reads and a large number of reads. Make tophat go faster!

Use threading option efficiently (tophat -p <number of threads>)
Split one data file into smaller chunks and run multiple instances of tophat. Finally concatenate the output.
- WAIT! We have a pipeline for that!
- Look for fastTophat in $BI/scripts
Split mapping by chromosome- mapping to each chromosome=1 job.

BACK TO COURSE OUTLINE

Version	Old Version 6	New Version Current
Changes made by	Dhivya Arasappan	Dhivya Arasappan
Saved on	May 19, 2014	May 21, 2015

Versions Compared

Key

Get your data

Run tophat

Examine tophat parameters

Examine tophat results

Let's see how this compares to BWA results...

Help! I have a lots of reads and a large number of reads. Make tophat go faster!