Differential expression with splice variant analysis aug2012
Differential expression with splice variant analysis at the same time: the Tuxedo pipeline
The Tuxedo Pipeline is a suite of tools for RNA-seq analysis, also known as the Tophat/Cufflinks workflow. It can be run in a variety of ways, optionally including de novo splice variant discovery. If an adequate set of splice variants is also available, it can also be run without splice variant detection to perform simple differential gene expression.
Resources
Useful RNA-seq resources are summarized on our Resources tool list, Transcriptome analaysis section. The most important of these resources for Tuxedo are:
- the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and
- the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)
- the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.
Objectives
In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:
- How the workflow was run and what steps are involved.
- What genes and isoforms are significantly differentially expressed
Introduction
Overall Workflow Diagram
This overview of tophat cufflinks workflow Diagram outlines the Tuxedo pipeline. We have annotated the image from the original paper to include the important file types at each stage, and to note the steps skippin in the "fast path" (no de novo junction assembly).
This is the full workflow that includes de novo splice variant detection. For simple differential gene expression, Steps 2 (cufflinks) and 3-4 (cuffmerge) can be omitted.
Tuxedo data requirements
What is required for this pipeline?
- One or more datasets (best with at least two biological replicates and at least two conditions) in fastq format
- A reference genome, indexed for the Bowtie or Bowtie2 aligner (see this Tophat Resource Bundles page)
- Optionally, a set of known splice variants in the form of a GTF (gene transfer format) or GFF (gene feature format) file. These are also packaged as part of the resource bundles.
Paths through the Tuxedo workflow
There are three major paths through this workflow:
- Simple differential gene expression analysis against a set of known splice variants.
- A GTF/GFF file is provided, and you specify that no novel junctions should be explored
- This is by far the fastest path through the workflow.
- Same as 1), but novel splice junctions should be explored in addition to known splice junctions
- A GTF/GFF file is provided, and you let the tool search for novel junctions also
- Use the input data to construct de novo splice junctions without reference to any known splice junctions
- No GTF/GFF is provided
What tophat does
The 1st, Tophat step is always required and sets the stage for all that follows. Tophat does a transcriptome-aware alignment of the input sequences to a reference genome using either the Bowtie or Bowtie2 aligner (in theory it can use other aligners, but we do not recommend this).
Split Read Alignment (Splice Finding)
To do this, Tophat goes through several steps:
- The input sequences are aligned to the transcriptome for your reference genome, if you provided a GTF/GFF file. Remember, the transcriptome annotation describes known full-length unspliced transcripts.
- sequences that align to the transcriptome are retained, and their coordinates are translated to genomic coordinates
- sequences that do not align to the transcriptome are subjected to further analysis below
- Remaining sequences are broken into sub-fragments of at least 25 bases, and these sub-fragments are aligned to the reference genome.
- if two adjacent sub-fragments align to non-adjacent genomic locations, they are "trans frags" that will be used to infer splice junctions
At the end of the Tophat process, you have a BAM file describing the alignment of the input data to genomic coordinates. This file can be used as input for downstream applications like Cuffmerge, which will assemble parsimonious consensus fragments from the BAM file coordinates.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache. If you require further assistance, please email wikihelp@utexas.edu.