Exercise #1: RIP-Seq

Objectives

In this exercise, we will examine (but not run, since it takes too long with any file big enough to get a result worth looking at) the details of analyzing a RIP-seq experiment using Cuffdiff to identify enriched genes in the IP relative to the control (and vice versa). In this case, Ago2 was pulled-down and associated whole-mRNAs fragmented and sequenced. The entire Tuxedo pipeline was used to analyze the data, including a Tophat2 alignment coupled with differential expression analysis with Cuffdiff.

Primary Data

The data consists of a set of files located in this directory:

export BI=/corral-repl/utexas/BioITeam/
ls $BI/rnaseq_course/ripseq_exercises/

You will see two subdirectories called "RIP" and "CLIP", each with files we will look at during these exercises. Go ahead and copy the "ripseq_exercises" directory and all its contents to your scratch area and load up the cufflinks module (which contains cuffdiff):

cds
cp -r $BI/rnaseq_course/ripseq_exercises/ .
module load cufflinks/2.1.1

The cuffdiff command syntax looks like this, and has a TON of options:

cuffdiff
cuffdiff v2.1.1 (4046M)
-----------------------------
Usage:   cuffdiff [options] <transcripts.gtf> <sample1_hits.sam> <sample2_hits.sam> [... sampleN_hits.sam]
   Supply replicate SAMs as comma separated lists for each condition: sample1_rep1.sam,sample1_rep2.sam,...sample1_repM.sam

Cuffdiff

Cuffdiff supports a wide range of dispersion estimation and library normalization methods, the details of which can be relevant when doing complex RIP-seq analysis. However, for most applications, the defaults of cuffdiff are perfectly fine, or at least a good place to start. If you look in the cuffdiff.cmds file located in the RIP subdirectory, you will find this:

cd $SCRATCH/ripseq_exercises/RIP/
cat cuffdiff.cmds
 
#Returns:

cuffdiff -p 6 --frag-bias-correct /BowtieIndex/genome.fa -u \
--library-type=fr-unstranded --emit-count-tables --multi-read-correct --min-alignment-count=5 \
--min-reps-for-js-test=1 --dispersion-method=per-condition --library-norm-method=geometric \
-o cuffdiff_out -L IP,INPUT ./genes.gtf \
./ip_sample/accepted_hits.bam ./input_sample/accepted_hits.bam

Option	Parameter	Relevance
- o	Output directory	Directs the (many) output files and directories
- p	Number of threads	Should match wayness setting in any batch jobs
- u	Multi read correct	Identify multiply-mapped reads and modify feature counts
- L	Labels	Labels to attach conditions; if not given, will just be numbers
--frag-bias-correct (also -b)	Fragment bias correction	Adjusts output for sequence-specific library prep bias and requires the FASTA file used in alignment
--min-alignment-count	Reads required to test region	STRONGLY affects runtime, since it determines the number of tests performed by cuffdiff
--min-reps-for-js-test=1	Replicates required to perform tests	Default is 3 replicates per condition, so if you have only 2 replicates, this must be set
--library-type=fr-unstranded	Library type	Most of the time, this is the appropriate setting
--dispersion-method=per-condition	Dispersion estimation method	Dispersions calculated for different conditions separately
--library-norm-method=geometric	Library normalization method	Geometric mean of each library used to adjust for sequencing depth

We're not going to run cuffdiff, since I didn't provide any BAM files. This is because, unlike many other programs, cuffdiff does not reduce the size of the library by removing duplicated reads, because it assumes that increases in total expression level indicate increases in expression level (at least most of the time). Instead, move into the cuffdiff_out directory, and start looking at the output files:

cd $SCRATCH/ripseq_exercises/RIP/cuffdiff_out/
ls -la
less gene_exp.diff

The Cuffdiff manual page describes the content of each of these files, including many that simply record your command parameters or provide absolute abundance estimates (e.g. FPKMs). The most important for immediate analysis, though, is usually the "diff" files, which describes the results of differential expression tests on different sets of features. Here are a few commands that let you see interesting groups of genes and their associated expression and test values:

cut -f 1,10 gene_exp.diff | sort -g -k2,2 | grep -v 'e+308' | less   # This gives a view of genes with the biggest fold change
awk '{if ($14=="yes") print $0}' gene_exp.diff | cut -f 1,10 | sort -g -k2,2 | less   # This gives genes that are called as differentially expressed AND are enriched in the IP relative to the Input, along with their fold changes

But what if you want to know about binding regions, not whole transcripts? And what if you don't have (or want to use) a reference annotation to define the coordinates that are tested between samples?