Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Many types of small RNA have been characterized, and their biological functions are extremely wide-ranging.  The table below describes the different forms and biological functions of some small and/or non-coding RNAs (though small RNAs are, almost by definition, non-coding).

Yao Y, Sun Q. Exploration of small non coding RNAs in wheat (Triticum aestivum L.). Plant Mol Biol. 2012;80(1):67-73.

Clearly, there are many biologically important functions executed by small RNA, and they can be studied by sequencing by simply cutting (for example) the 25-50bp range out of a size selection gel followed by otherwise normal library preparation.  Otherwise, all these species share certain qualities that allow sequencing data derived from each to be analyzed in a similar fashion.  These qualities can include (but are not limited to):

...

Similarly, RNA-protein interactions are required for an equally diverse set of biological functions, and hundreds of RNA-binding proteins have been identified.  It is frequently interesting to isolate protein-RNA complexes, remove the protein, and sequence the resulting RNA.  The methods involve combine components of RNA-seq, because the underlying molecule is RNA, and chromatin immunoprecipitation (ChIP), because the most common mechanism to isolate a protein-RNA complex is with an antibody raised against a fragment of the protein of interest.  Below is a sample protocol flow for a RIP-seq experiment.

Zhao J, Ohsumi TK, Kung JT, et al. Genome-wide identification of polycomb-associated RNAs by RIP-seq. Mol Cell. 2010;40(6):939-53.

For "normal" RIP-seq, one usually expects to recover full RNA molecules regardless of where on an RNA molecule the protein was bound, since all of it is 'pulled down' together.  However, such protocols generally do not use any chemical or physical means to covalently attach the RNA to the protein, which allows for the possibility that the RNA and protein complexes disassociate and re-associate from each other during sample preparation (there have been published papers that claim this - see here).  Moreover, proteins will often bind to specific RNA sequence motifs or positions, and retrieval of the full RNA molecule provides no information about the specific binding site. To accommodate these concerns, methods have been developed to cross-link protein to RNA in a way that leaves a signature of interaction where the protein and RNA actually come into contact.  Below is a table of the three methods that modify the RNA in various ways to enable binding site detection by sequencing.

König J, Zarnack K, Luscombe NM, Ule J. Protein-RNA interactions: new genomic technologies and perspectives. Nat Rev Genet. 2011;13(2):77-83.

In our second exercise, we will use a recently developed tool to analyze some sample PARCLIP data to identify specific binding sites of a protein across the entire human transcriptome.

...

For standard RIP-seq, many of the methods already covered in this class are useful since one can expect to recover a full RNA molecule, and IP and Mock/No Antibody/IgG samples can be thought of as "conditions" to be compared by differential expression analysis.  If the only comparison is between IP and Input, than the tools you have already learned about can be used to quantitate expression for each transcript, and fold changes can be subsequently calculated.  However, more specific tools do exist, particularly for CLIP-seq and its variants.  Below is a table from a semi-recent paper that summarizes some of the most widely used tools in RIP and CLIP experiments.

Li Y, Zhao DY, Greenblatt JF, Zhang Z. RIPSeeker: a statistical package for identifying protein-associated transcripts from RIP-seq experiments. Nucleic Acids Res. 2013;41(8):e94.

Some of these tools, like Cuffdiff (and similar tools like edgeR and DESeq) can be used as you would for normal differential expression analysis in standard RIP-seq experiments.  These programs tend to be available directly at TACC.  Others, like MACS, are available at TACC but are not really designed for use with RNA-seq data.  Finally, programs like RIPseeker and PARalyzer are much less widely-used (since they are much more recent), but are designed for extremely specific experimental structures.  PARalyzer, for example, is explicitly and only to be used with PAR-CLIP data.

...

In the exercises that follow, we will use samtools to generate miRNA profiles (Exercise #1), parse Cuffdiff results to evaluate mRNA enrichment in a 'normal' RIP-seq experiment (Exercise #2) , and implement PARalyzer to analyze a real (but down-sampled toy PARCLIP dataset) PARCLIP data (Exercise #2).

Exercise #1: miRNA Sequencing and Profiling (miRNA-seq)

...

Code Block
module load perl
module load bowtie/2.2.0
cd $SCRATCH/my_rnaseq_course/day_4b/mirbase
bowtie2-build hairpin_cDNA_hsa.fa hairpin_cDNA_hsa.fa

...

To run the alignment, we execute a command that is very similar to BWA or Tophat2, but with different syntax:

Code Block
cd $SCRATCH/my_rnaseq_course/day_4b
bowtie2 --local -N 1 -L 16 -x mirbase/hairpin_cDNA_hsa.fa -U human_mirnaseq.fastq.gz -S human_mirnaseq.sam
Expand
titleWhat's going on?

Parameters are:

  • --local – local alignment mode
  • -N 1 – allow 1 seed mismatch
  • -L 16 – seed length 16
  • -x  mirbase/hairpin_cDNA_hsa.fa – prefix path of index files
  • -U human_mirnaseq.fastq.gz – FASTQ file for single-end (Unpaired) alignment
  • -S human_mirnaseq.sam – tells bowtie2 to report alignments in SAM format to the specified file

...

Code Block
cd $SCRATCH/my_rnaseq_course/day_4b
cp -r /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_4b/PARalyzer_v1_5 .
lscd -la PARalyzer_v1_5
ls -la

As you will see, this directory contains many files, including the PARalyzer executable itself.  It also contains the SAM file that we will analyze.  Go ahead and take a look at the SAM file using less.  In particular, below is a sample read that contains the characteristic mutational signal that (in principle) indicates close proximity to a protein during crosslinking:

...

The other two files in the PARalyzer directory that are worth mentioning are "hg19.2bit", which is a special binary form of the hg19 human genome build, and "sample.ini," which contains all specifications for PARalyzer.  In fact, to run PARalyzer, we provide it only with a memory allotment and the name of the sample.ini file.  Go ahead and see what is in sample.ini using less or cat:

Code Block
less PARalyzer_v1_5/sample.ini
 
#BANDWIDTH=3
#CONVERSION=T>C
#MINIMUM_READ_COUNT_PER_GROUP=10
#MINIMUM_READ_COUNT_PER_CLUSTER=1
#MINIMUM_READ_COUNT_FOR_KDE=1
#MINIMUM_CLUSTER_SIZE=1
#MINIMUM_CONVERSION_LOCATIONS_FOR_CLUSTER=2
#MINIMUM_CONVERSION_COUNT_FOR_CLUSTER=1
#MINIMUM_READ_COUNT_FOR_CLUSTER_INCLUSION=1
#MINIMUM_READ_LENGTH=1
#MAXIMUM_NUMBER_OF_NON_CONVERSION_MISMATCHES=5
#SAM_FILE=./Kishore_Ago2_parclip.sam
#GENOME_2BIT_FILE=./hg19.2bit
#OUTPUT_CLUSTERS_FILE=./Kishore_Ago2_clusters.csv
#EXTEND_BY_READ

...

Running this command would produce the output file Kishore_Ago2_clusters.csv, but would take quite a bit time since the tool does not scale particularly well to increasing read depths.  Consequently, I have already prepared the output file, and named it Kishore_Ago2_clusters_done.csv.  Feel free to try to generate the output at some point, if you like, but in the interests of time, we will proceed to parsing the PARalyzer output to identify interesting interactions.

Results

...

Parsing

Use head to look at the Kishore_Ago2_clusters_done.csv file using the following commands:

...

Clearly, there are MANY cases of T>C mutations that (in reality - you can't see this from the above commands alone) are all at the same nucleotide.  If one were to view the reads in IGV or another browser, one would expect to see something like this:

Figure from Hafner M, Lianoglou S, Tuschl T, Betel D. Genome-wide identification of miRNA targets by PAR-CLIP. Methods. 2012;58(2):94-105.

Data from Lipchina I, Elkabetz Y, Hafner M, et al. Genome-wide identification of microRNA targets in human ES cells reveals a role for miR-302 in modulating BMP response. Genes Dev. 2011;25(20):2173-86.

In the figure, red represents reference bases and yellow represents T>C mutation events.  miRNA binding sites are shown because this PAR-CLIP experiment targeted Ago2, so the regions retrieved were primarily miRNA binding sites.