Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Why splice aware/split alignment is important?

Split Read Alignment 

Get your data...

Six raw data files were provided as the starting point:

  • c1_r1, c1_r2, c1_r3 from the first biological condition
  • c2_r1, c2_r2, and c2_r3 from the second biological condition
  • Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).

 

Code Block
titleGet set up for the exercises
cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/tophat_exercise . &
cd tophat_exercise

Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in the directory run_commands. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results (in the directory results).

Run tophat

On lonestar, to run tophat, following modules need to be loaded.

Code Block
module load boost/1.45.0
module load bowtie
module load tophat
Code Block
titleGeneral syntax for tophat command
tophat [options] <bowtie_index_prefix> <reads1> <reads2>

Look at run_commands/tophat.commands to see how it was run.

...

cat run_commands/tophat.commands

tophat -p 2 -G reference/genes.gtf -o C1_R1_thout reference/genome.fa data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq

 

Code Block
titleTake a minute to look at the output files produced by one tophat run.
cd results/C1_R1_thout
ls -l

-rw-rw---- 1 daras G-801020 331M May 16 23:35 accepted_hits.bam
-rw------- 1 daras G-801020  563 May 16 23:35 align_summary.txt
-rw------- 1 daras G-801020   52 May 16 23:35 deletions.bed
-rw------- 1 daras G-801020   54 May 16 23:35 insertions.bed
-rw------- 1 daras G-801020 2.9M May 16 23:35 junctions.bed
drwx------ 2 daras G-801020  32K May 16 23:35 logs
-rw------- 1 daras G-801020  184 May 16 23:35 prep_reads.info
-rw------- 1 daras G-801020  442 May 16 23:35 unmapped.bam

Exercise 1a: Providing a transcript annotation file

Which tophat option is used to provide a transcript annotation file (GTF file) to use?

...

Remember that you can type the command and hit enter to get help info about it.
Or Google tophat to find its online manual

...

The -G <GTF filename> to use the annotated splice junctions in the supplied GTF file

Exercise 1b: Using only annotated junctions
How would I tell tophat to only use a specified set of transcript annotation and not assemble any novel transcripts?

...

Specify -G <gtf filename> to have tophat use the annotated transcripts in the supplied GTF file
Also add the --no-novel-juncs option to suppress de novo junction detection

As you can see there are many many other options for running tophat!

Exercise 2a: Examine a BAM file

Examine a few lines of the C1_R1 alignment file.

...

samtools view

...

Code Block
samtools view -x accepted_hits.bam | less  # :q to quit

#make sure you are in results/C1_R1_thout to run this

Exercise 3b: Spliced sequences
Find a spliced alignment.

How is a spliced sequence represented in the BAM file?

...

The 6th BAM file field is the CIGAR string which tells you how your query sequence mapped to the reference.

...

 samtools view accepted_hits.bam | cut -f 1,6 | grep 'N'|head

The CIGAR string "66M76N9M" represents a spliced sequence. The codes mean:

  • 56M - the first 58 bases match the reference
  • 76N - there are then 76 bases on the reference with no corresponding bases in the sequence (an intron)
  • 17M - the last 17 bases match the reference

Exercise 4: Count spliced sequences

How many spliced sequences are there in the C1_R1 alignment file?

...

samtools view and cut and grep

...

Code Block
samtools view accepted_hits.bam | cut -f 6 | grep 'N' | wc -l

Let's see how this compares to BWA results...

Code Block
titleBack to BWA results
cds
cd $SCRATCH/my_rnaseq_course/bwa_exercise/results/bwa_mem

 

Exercise 4b: Count spliced sequences in BWA results
How many spliced sequences are there in the C1_R1 alignment file?

...

samtools view and cut and grep

...

Code Block
samtools view C1_R1.mem.bam| cut -f 6 | grep 'N' | wc -l

Exercise 5: How does a read with tophat spliced alignment look in the BWA results?

...

 

Now on to our tophat exercises.