Differential expression with splice variant analysis aug2012

Differential expression with splice variant analysis aug2012

Differential expression with splice variant analysis at the same time: the Tuxedo pipeline

The Tuxedo Pipeline is a suite of tools for RNA-seq analysis, also known as the Tophat/Cufflinks workflow. It can be run in a variety of ways, optionally including de novo splice variant discovery. If an adequate set of splice variants is also available, it can also be run without splice variant detection to perform simple differential gene expression.

Resources

Useful RNA-seq resources are summarized on our Resources tool list, Transcriptome analaysis section. The most important of these resources for Tuxedo are:

  1. the original RNAseq analysis protocol using Tuxedo article in Nature Protocols, and

  2. the URL for Tuxedo resource bundles for selected organisms (gff annotations, pre-built bowtie references, etc.)

  3. the example data we'll use for this tutorial came from this experiment which has the raw fastq data in the SRA.

Objectives

In this lab, you will explore a fairly typical RNA-seq analysis workflow using the Tuxedo pipeline. Simulated RNA-seq data will be provided to you; the data contains 75 bp paired-end reads that have been generated in silico to replicate real gene count data from Drosophila. The data simulates two biological groups with three biological replicates per group (6 samples total). This simulated data has already been run through a basic RNA-seq analysis workflow. We will look at:

  1. How the workflow was run and what steps are involved.

  2. What genes and isoforms are significantly differentially expressed

Introduction

Overall Workflow Diagram

This overview of tophat cufflinks workflow Diagram outlines the Tuxedo pipeline. We have annotated the image from the original paper to include the important file types at each stage, and to note the steps skippin in the "fast path" (no de novo junction assembly).

This is the full workflow that includes de novo splice variant detection. For simple differential gene expression, Steps 2 (cufflinks) and 3-4 (cuffmerge) can be omitted.

Tuxedo data requirements

What is required for this pipeline?

  • One or more datasets (best with at least two biological replicates and at least two conditions) in fastq format

  • A reference genome, indexed for the Bowtie or Bowtie2 aligner (see this Tophat Resource Bundles page)

  • Optionally, a set of known splice variants in the form of a GTF (gene transfer format) or GFF (gene feature format) file. These are also packaged as part of the resource bundles.

Paths through the Tuxedo workflow

There are three major paths through this workflow:

  1. Simple differential gene expression analysis against a set of known splice variants.

    • A GTF/GFF file is provided, and you specify that no novel junctions should be explored

    • This is by far the fastest path through the workflow.

  2. Same as 1), but novel splice junctions should be explored in addition to known splice junctions

    • A GTF/GFF file is provided, and you let the tool search for novel junctions also

  3. Use the input data to construct de novo splice junctions without reference to any known splice junctions

    • No GTF/GFF is provided

What tophat does

The 1st, Tophat step is always required and sets the stage for all that follows. Tophat does a transcriptome-aware alignment of the input sequences to a reference genome using either the Bowtie or Bowtie2 aligner (in theory it can use other aligners, but we do not recommend this).

Split Read Alignment (Splice Finding)

To do this, Tophat goes through several steps:

  1. The input sequences are aligned to the transcriptome for your reference genome, if you provided a GTF/GFF file. Remember, the transcriptome annotation describes known full-length unspliced transcripts.

    • sequences that align to the transcriptome are retained, and their coordinates are translated to genomic coordinates

    • sequences that do not align to the transcriptome are subjected to further analysis below

  2. Remaining sequences are broken into sub-fragments of at least 25 bases, and these sub-fragments are aligned to the reference genome.

    • if two adjacent sub-fragments align to non-adjacent genomic locations, they are "trans frags" that will be used to infer splice junctions

At the end of the Tophat process, you have a BAM file describing the alignment of the input data to genomic coordinates.  This file can be used as input for downstream applications like Cuffmerge, which will assemble parsimonious consensus fragments from the BAM file coordinates.

What cufflinks and cuffmerge do

Cuffmerge Assembly

For each separate dataset representing a specific replicate and condition, cufflinks assembles a map of genomic areas enriched in aligned reads. cuffmerge then takes the set of individual assemblies and merges them into a consensus assembly for all the provided datasets. The consensus may include known splice variant annotations if you have provided those to the program.

What cuffdiff and cummeRbund do

Next, cuffdiff uses the consensus splice variant annotations (and/or the known splice variants) to quantify expression levels of genes and isoforms, using FPKM (fragments per kilobase per million reads) metrics.

Finally, cummeRbund creates pretty differential expression plots of the FPKM data using R.

Notes on FASTQ preparation

Although we won't cover these issues here, there are some issues you should consider before embarking on the Tuxedo pipeline:

  1. Should my FASTQ sequences be trimmed to remove low-quality 3' bases?

  2. Should I remove adapter sequences before running Tophat?

  3. Should I attempt to remove sequences that map to undesired RNAs before running Tophat? (rRNA for example)

  4. How would, for example, rRNA sequence removal be done?

  5. What other pre-processing steps might I consider?

Some Logistics...

Six raw data files were provided as the starting point:

  • c1_r1, c1_r2, c1_r3 from the first biological condition

  • c2_r1, c2_r2, and c2_r3 from the second biological condition

Due to the size of the data and length of run time, most of the programs have already been run for this exercise. The commands run are in different *.commands files. We will spend some time looking through these commands to understand them. You will then be parsing the output, finding answers, and visualizing results.

Data for this section is all located under $BI/ngs_course/tophat_cufflinks/ So cd into this directory now.

Commands used are in different *.commands files located in $BI/ngs_course/tophat_cufflinks/run_commands

Some output, like bam files can be gotten from the URL http://loving.corral.tacc.utexas.edu/bioiteam/tophat_cufflinks/ and directly loaded into IGV, using Load from URL.

If you generate your own output and would like to view them on IGV, you will need to scp the files from lonestar to your computer.

On your computer's side:

Go to the directory where you want to copy files to.

scp my_user_name@lonestar.tacc.utexas.edu:/home/.../stuff.fastq ./

Replace the "/home/..." with the "pwd" information obtained earlier.

This command would transfer "stuff.fastq" from the specified directory on Lonestar to your current directory on your computer.

For help, remember to type the commands and hit enter to see what help they can offer you. You will also almost certainly need to consult the documentation for tophat, cufflinks and cummeRbund:

Exercise Workflow