...
For "normal" RIP-seq, one usually expects to recover full RNA molecules regardless of where on an RNA molecule the protein was bound, since all of it is 'pulled down' together. However, such protocols generally do not use any chemical or physical means to covalently attach the RNA to the protein, which allows for the possibility that the RNA and protein complexes disassociate and re-associate from each other during sample preparation (there have been published papers that claim this - see here). Moreover, proteins will often bind to specific RNA sequence motifs or positions, and retrieval of the full RNA molecule provides no information about the specific binding site. To accommodate these concerns, methods have been developed to cross-link protein to RNA in a way that leaves a signature of interaction where the protein and RNA actually come into contact. Below is a table of the three methods that modify the RNA in various ways to enable binding site detection by sequencing.
In our second exercise, we will use a recently developed tool to analyze some sample PARCLIP data to identify specific binding sites of a protein across the entire human transcriptome.
Important Software
For standard RIP-seq, many of the methods already covered in this class are useful since one can expect to recover a full RNA molecule, and the IP and Input Mock/No Antibody/IgG samples can be thought of as "conditions" to be compared by differential expression analysis. If the only comparison is between IP and Input, than the tools you have already learned about can be used to quantitate expression for each transcript, and fold changes can be subsequently calculated. However, more specific tools do exist, particularly for CLIP-seq and its variants. Below is a table from a semi-recent paper that summarizes some of the most widely used tools in RIP and CLIP experiments.
...
In the exercises that follow, we will use samtools to generate miRNA profiles (Exercise #1), parse Cuffdiff results to evaluate mRNA enrichment in a 'normal' RIP-seq experiment (Exercise #2), and implement PARalyzer to analyze a down-sampled toy PARCLIP dataset.
Exercise #1: miRNA
...
Sequencing and Profiling (miRNA-seq)
I have downloaded and set up In this exercise, we will analyze a sample microRNA-seq dataset derived from H1 human embryonic stem cells , that was generated for the ENCODE project and made publicly available a few years ago. These are 1x36 Illumina reads derived from all cellular RNA that is less than 200bp. Our end goal will be to obtain a microRNA profile, or counts of how many reads . We are derived from each microRNA.
Reference Building
Recall that, because these RNAs are very short, they may align multiple times throughout the genome. Moreover, our goal (as is frequently the case) is to quantitate all known small RNAs of a given class, rather than discover new members. Thus, it makes sense to align our sequences against a database of miRNA sequences (or snRNA, or tRNA, or...) where identical sequences are collapsed. As we will see, this facilitates down stream analysis and is also significantly faster since the genomic search space is dramatically reduced.
To obtain a FASTA file with all human miRNA sequences, execute these commands:
| Code Block |
|---|
mkdir -p $SCRATCH/my_rnaseq_course/day_4b
cd $SCRATCH/my_rnaseq_course/day_4b
cp /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/human_mirnaseq.fastq.gz .
less human_mirnaseq.fastq.gz |
Data Staging
We will first copy them over from the BioITeam area on Corral, stage them in a directory in your scratch area, and look at them a little bit. The commands to do that would look something like this:
...
So, we would like to use an alignment strategy that can intelligently ignore the parts that won't align to a reference (the 'adapter') and align correctly the parts that align well. This is called a 'local' alignment, in contrast to a 'global' alignment, which would count the 'mismatches' in the adapter against the alignment score. Fortunately, you have already used a local-alignment-capable aligner in this class. Tophat2 runs on the Bowtie2 alignment engine, which (if used directly, i.e. not with Tophat2), can perform local alignment. So, that won't be a problem.But wait! The other major issue here is that a given miRNA sequence may occur many times in the genome, and each locus will produce an identical mature miRNA sequence. .
Exercise
...
#3: Ribonucleoprotein Immunoprecipitation and Sequencing (RIP-seq)
...
...