...
For standard RIP-seq, many of the methods already covered in this class are useful since one can expect to recover a full RNA molecule, and the IP and Input samples can be thought of as "conditions" to be compared by differential expression analysis. However, more specific tools do exist, particularly for CLIP-seq and its variants. Below is a table from a semi-recent paper that summarizes some of the most widely used tools in RIP and CLIP experiments.
Some of these tools, like Cuffdiff (and similar tools like edgeR and DESeq) can be used as you would for normal differential expression analysis in standard RIP-seq experiments. These programs tend to be available directly at TACC. Others, like MACS, are available at TACC but are not really designed for use with RNA-seq data. Finally, programs like RIPseeker and PARalyzer are much less widely-used (since they are much more recent), but are designed for extremely specific experimental structures. PARalyzer, for example, is explicitly and only to be used with PAR-CLIP data.
In the exercises that follow, we will use samtools to generate miRNA profiles (Exercise #1), parse Cuffdiff results to evaluate mRNA enrichment in a 'normal' RIP-seq experiment (Exercise #2), and implement PARalyzer to analyze a down-sampled toy PARCLIP dataset.
Exercise #1: miRNA/small RNA Sequencing and Profiling (miRNA-seq)
I have downloaded and set up a sample microRNA-seq dataset derived from H1 human embryonic stem cells, generated for the ENCODE project. These are 1x36 Illumina reads. We will first copy them over from the BioITeam area on Corral, stage them in a directory in your scratch area, and look at them a little bit. The commands to do that would look something like this:
| Code Block |
|---|
mkdir -p $SCRATCH/my_rnaseq_course/day_4b
cd $SCRATCH/my_rnaseq_course/day_4b
cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.fastq.gz .
less human_mirnaseq.fastq.gz |
A sample miRNA FASTQ entry, using less, might look like this:
| Code Block |
|---|
@TUPAC_0037_FC62EE7AAXX:2:1:2000:1139#0/1
TAGCAGCACGTCAGTATTGNCGTAAAAAAAAAAAAG
+TUPAC_0037_FC62EE7AAXX:2:1:2000:1139#0/1
ffafffffff\U_La[[W[B^a^abfffcccccccc |
The third line has the name attached after the "+", which is an artifact of a storage method that we won't go into here. However, everything else is basically the same - read name, followed by sequence, strand, and quality scores. However, note the string of A's towards the end. This is because, as for many very short RNAs, our read extends past the actual RNA fragment. In this case, the 'adapter' sequence is obvious - it is just a poly-A string. However, what if it wasn't? Indeed, working with publicly available small RNA data, you will often not know what the adapter is (it may not be obvious), or you might not even know if for data coming from your lab (if we're being honest).
Exercise #2: Ribonucleoprotein Immunoprecipitation and Sequencing (RIP-seq)
...