Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline to identify novel transcripts. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis the workflow. We will look at:

  1. How the workflow was run and what steps are involved.
  2. What genes and isoforms are significantly differentially expressed and what novel transcripts were identified.

Resources

Useful resources for Tuxedo suite are:

  1. the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
  2. the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
  3. the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.

...

Back to the Big Picture

Let's revisit that pipeline diagram here

Image Removed

You can see see that tuxedo suite can be used in two main ways:Image Added

Paths through the Tuxedo workflow

There are three major paths through this workflow:

NO NOVEL JUNCTIONSSimple differential gene expression analysis against a set of known splice variants.

    • A GTF/GFF file is provided, and you specify that no novel junctions should be explored
    • This is by far the fastest path through the workflow. 

NOVEL JUNCTIONS:

  1. Same as 1), but novel splice junctions should be explored in addition to known splice junctions
    • A GTF/GFF file is provided, and you let the tool search for novel junctions also
  2. Use the input data to construct de novo splice junctions without reference to any known splice junctions
    • No GTF/GFF is provided

Step I: What does

...

Tophat do?

...

As discovered in previous sections, tophat maps your data to your reference in a transcriptome-aware manner, that will also identify junctions.  We've already looked at how you can tell it to identify novel junctions.

Image Removed

 

tophat recap: options for novelty

 

--no-novel-juncsOnly look for reads across junctions indicated in the supplied GFF or junctions file. (ignored without -G/-j)
-G/--GTF <GTF/GFF3 file>

Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.

Step 2: What does cufflinks do? and how does it do it?

For each sample, cufflinks assembles aligned reads into transcripts and calculates their abundance. 

The new RABT feature

Cuuflinks uses RABT (Reference annotation based transcript assembly) as a method to use existing annotation to guide the assembly of transcripts.  

Image Added

Step 3: What does cuffmerge do? and how does it do it?

For each separate dataset representing a specific replicate and condition, cufflinks assembles a map of genomic areas enriched in aligned reads.  cuffmerge Cuffmerge then takes the set of individual assemblies and merges them into a consensus assembly for all the provided datasets. The consensus may include known splice variant annotations if you have provided those to the program.

assembly figure

rabt figure

Image Removed

 

OUTput files

CUFFMERGE CUFFCOMPARE

.

...

Image Added

Step 4: What does cuffcompare do?

Step 5: What does cuffdiff do?

Next, cuffdiff uses the consensus splice variant annotations (and/or the known splice variants) to quantify expression levels of genes and isoforms, using FPKM (fragments per kilobase per million reads) metrics.

Image Removed

Code Block
titleGeneral syntax for cufflinks command
cufflinks [options] <hits.bam>

Look at $BI/ngs_course/tophat_cufflinks/run_commands/cufflinks.commands to see how it was run.

...

cat $BI/ngs_course/tophat_cufflinks/run_commands/cufflinks.commands

Take a minute to look at the output files produced by one cufflinks run.
The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.

Code Block
titleCufflinks output files
cd $BI/ngs_course/tophat_cufflinks/C1_R1_clout
ls -l


drwxrwxr-x 2 nsabell G-801021    32768 May 22 15:10 cuffcmp
-rwxr-xr-x 1 daras G-803889  14M Aug 16 12:49 transcripts.gtf
-rwxr-xr-x 1 daras G-803889 597K Aug 16 12:49 genes.fpkm_tracking
-rwxr-xr-x 1 daras G-803889 960K Aug 16 12:49 isoforms.fpkm_tracking
-rwxr-xr-x 1 daras G-803889    0 Aug 16 12:33 skipped.gtf

Step 3: Merging assemblies using cuffmerge

Create a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:

Code Block
cd $BI/ngs_course/tophat_cufflinks
find . -name transcripts.gtf > assembly_list.txt
cuffmerge <assembly_list.txt>

...

Code Block
cat $BI/ngs_course/tophat_cufflinks/assembly_list.txt

Take a minute to look at the output files produced by cuffmerge. The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated.

Code Block
titlecuffmerge output
cd $BI/ngs_course/tophat_cufflinks/merged_asm
ls -l

-rwxrwxr-x  1 daras G-803889  1571816 Aug 16  2012 genes.fpkm_tracking
-rwxrwxr-x  1 daras G-803889  2281319 Aug 16  2012 isoforms.fpkm_tracking
drwxrwxr-x  2 daras G-803889    32768 Aug 16  2012 logs
-r-xrwxr-x  1 daras G-803889 32090408 Aug 16  2012 merged.gtf
-rwxrwxr-x  1 daras G-803889        0 Aug 16  2012 skipped.gtf
drwxrwxr-x  2 daras G-803889    32768 Aug 16  2012 tmp
-rwxrwxr-x  1 daras G-803889 34844830 Aug 16  2012 transcripts.gtf

Step 4: Finding differentially expressed genes and isoforms using cuffdiff

Code Block
cuffdiff [options] <merged.gtf> <sample1_rep1.bam,sample1_rep2.bam> <sample2_rep1.bam,sample2_rep2.bam>

Exercise: What does cuffdiff -b do?

...

-b is for enabling fragment bias correction.

 

 

  

Image Added

Let's look at the commands to perform these steps and how the output files look...