Differential expression with splice variant analysis
NOTE: DO this exercise on your local computer, not Lonestar
Differential expression with splice variant analysis at the same time: the Tophat/Cufflinks workflow
In this lab, you will explore a fairly typical RNA-seq analysis workflow. Simulated RNA-seq data will be provided to you; the data contains 50 bp single-end (not paired-end) reads that originated from real human genes, but which include both simulated mutations (SNPs and short indels) and sequencing errors. We have simulated two biological groups with three biological replicates per group (6 samples total). The genetic background of each replicate within a group is identical, but there may be different mutations between the two groups. Your job is to determine what genes, if any, are significantly differentially expressed, to see if there is evidence of alternative splicing between the two samples, and if there is alternative splicing, to determine if that gene has some known disease association.
Your workflow would be:
a) map reads against the human hg19 reference and infer splice junctions (using tophat) (this has already been done to save time)
b) assemble the putative transcripts (using cufflinks) (this has already been done to save time)
c) view the putative transcripts using the UCSC genome browser and/or IGV
d) compute differential expression (using cuffdiff) and find genes and isoforms significantly differentially expressed (by parsing output)
e) examine some differential splicing
f) looking up the gene that shows differential splicing in the UCSC browser and determine what tissue these RNA samples probably came from.
You will almost certainly need to consult the documentation for tophat and cufflinks:
http://tophat.cbcb.umd.edu/manual.html
http://cufflinks.cbcb.umd.edu/manual.html
Six raw data files were provided as the starting point: mut1.0, mut1.1, mut1.2 from the first biological group, and mut2.0, mut2.1, and mut2.2 from the second.
Due to the size of the data and length of run time, *all the programs have already been run for this exercise*. Shell scripts and their log files are provided so you can see the actual commands used and run output. You will be parsing the output, finding answers, and visualizing results.
Download the data: http://web.corral.tacc.utexas.edu/GSAF/lab9.tar.gz, unzip and untar it on your local computer, not Lonestar:
wget http://web.corral.tacc.utexas.edu/GSAF/lab9.tar.gz gunzip lab9.tar.gz tar -xvf lab9.tar cd tophat
Find and inspect your data: 35 points
The file "tophat.sh" was used to run tophat and cufflinks on the input fasta files. The log files in the main tophat directory contain the output from the respective tophat runs, while the file "tophat.sh.log" contains the output from all the cufflinks runs. Use this information to figure out where the output you need has wound up.
Before you even start, do a sanity check on your data by examining the bam files from the mapping output.
I've included the directory "bwa_genome" containing the output from bwa of the mut1.1.fasta file mapped directly to the genome using bwa. You might want to load this into IGV alongside the tophat/bowtie bam file for the same sample to see the differences between mapping to the transcriptome and mapping to the genome (like you lose a bunch of reads).
Report the total number of input reads and the total number of mapped reads for each sample using "samtools flagstat". For mut1.1, compare the percentage mapped to the genome to the percentage mapped via tophat/bowtie to the genome.
For c) figure out where the output .bed and .bam files are for each of the input fasta files.
Once you've figured that out, load all 6 junction.bed files into IGV (you might want to change the names of the files so you can remember which is which). From here, you can zoom in to a region where there is data (look for the peaks in your .bed file input).
Now, load ONE of the bam file outputs from the tophat run (remember, tophat calls bowtie for mapping). These are large files; you can load more than one, just be prepared that it may slow your computer down. If you want to load two, I'd suggest loading one from mut1 and one from mut2 so you see genuine expression level changes.
Differential expression analysis: 50 points
For d), see the file "cuffdiff.sh" and "cuffdiff.sh.log" for the input, command line, and output. Then, find the cuffdiff output (either by understanding cuffdiff.sh or by looking) and by looking at it and/or reading the documentation find the isoforms and genes that are differentially expressed. Note that cuffdiff has performed a statistical test on the expression values between our two biological groups. It reports the FPKM expression levels for each group, the log2(group 1 FPKM/ group 2 FPKM), and a p-value measure of statistical confidence, among many other helpful
data items.
25 points: Find the two genes that are differentially expressed at the gene level, but not at the isoform level. This can be done with a python or perl script, or with a one-line linux command.
Bonus points: explain why there is a difference.
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
cat isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
Note that you can sort by a different column, like 12 (p-value) instead of 10; sort
can also sort on more than one column.
Now go to the GABBR1 gene in either IGV or UCSC browsers. Hint: you can type a gene name directly into the coordinate box on either browser and if it can find a match it will take you there.
25 points: Identify the two exons of GABBR1 that appear to not be expressed in at least some isoforms of mut1.
Data Interpretation: 15 points
For f), go to USCS and load the OMIM Genes track, then use the browser to link to the OMIM entry related to the GABBR1 gene.
For 15 points, read the OMIM entry sufficiently to report the likely organ we extracted RNA from to generate this data set.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.