Tuxedo Suite For Splice Variant Analysis and Identifying Novel Transcripts

Objectives

In this lab, you will explore a RNA-seq analysis workflow using the Tuxedo pipeline to identify novel transcripts. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through the workflow. We will look at:

  1. How the workflow was run and what steps are involved.
  2. What genes and isoforms are significantly differentially expressed and what novel transcripts were identified.

Resources

Useful resources for Tuxedo suite are:

  1. the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
  2. the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
  3. the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.

Back to the Big Picture

Let's revisit that pipeline diagram here

Paths through the Tuxedo workflow

There are three major paths through this workflow:

NO NOVEL JUNCTIONSSimple differential gene expression analysis against a set of known splice variants.

    • A GTF/GFF file is provided, and you specify that no novel junctions should be explored
    • This is by far the fastest path through the workflow. 

NOVEL JUNCTIONS:

  1. Same as 1), but novel splice junctions should be explored in addition to known splice junctions
    • A GTF/GFF file is provided, and you let the tool search for novel junctions also
  2. Use the input data to construct de novo splice junctions without reference to any known splice junctions
    • No GTF/GFF is provided

Step I: What does Tophat do?

As discovered in previous sections, tophat maps your data to your reference in a transcriptome-aware manner, that will also identify junctions.  We've already looked at how you can tell it to identify novel junctions.

--no-novel-juncsOnly look for reads across junctions indicated in the supplied GFF or junctions file. (ignored without -G/-j)
-G/--GTF <GTF/GFF3 file>

Supply TopHat with a set of gene model annotations and/or known transcripts, as a GTF 2.2 or GFF3 formatted file.

Step 2: What does cufflinks do? and how does it do it?

For each sample, cufflinks assembles aligned reads into transcripts and calculates their abundance. 

The new RABT feature

Cuuflinks uses RABT (Reference annotation based transcript assembly) as a method to use existing annotation to guide the assembly of transcripts.  

Step 3: What does cuffmerge do? and how does it do it?

For each separate dataset representing a specific replicate and condition, cufflinks assembles a map of genomic areas enriched in aligned reads. Cuffmerge then takes the set of individual assemblies and merges them into a consensus assembly for all the provided datasets. The consensus may include known annotations if you have provided those to the program.

Step 4: What does cuffcompare do? (Optional)

Cuffcompare allows you to compare your assembled transcripts to existing annotation.

Step 5: What does cuffdiff do?

Next, cuffdiff uses the consensus splice variant annotations (and/or the known splice variants) to quantify expression levels of genes and isoforms, using FPKM (fragments per kilobase per million reads) metrics.

 

Let's look at the commands to perform these steps and how the output files look...