All these steps have already been run. We'll be spending time looking at the commands and output. Let's get set up.
Get to the results
cds cd my_rnaseq_course #If you've already copied day_3 stuff over, ignore next line cp -r /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3/cufflinks_results . cd cufflinks_results
Step 1: Tophat
We've already gone over how tophat results look here so let's move on to step 2.
Step 2: Cufflinks
HOW WAS IT RUN?
#If you've copied it over successfully: cat run_commands/commands.cufflinks #If you don't have a local copy, you can read it from the source: cat /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3/cufflinks_results/run_commands/commands.cufflinks
HOW DOES THE OUTPUT LOOK?
Take a look at output for one of our samples, C1_R1. The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.
#If you have a local copy: ls -l C1_R1_clout #If you don't have a local copy: ls -l /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3/cufflinks_results/C1_R1_clout -rw------- 1 daras G-801020 627673 May 17 16:58 genes.fpkm_tracking -rw------- 1 daras G-801020 1021025 May 17 16:58 isoforms.fpkm_tracking -rw------- 1 daras G-801020 0 May 17 16:50 skipped.gtf -rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf
DESCRIPTION OF TRANSCRIPTS.GTF FILE
Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.
Column number | Column name | Example | Description |
1 | seqname | chrX | Chromosome or contig name |
2 | source | Cufflinks | The name of the program that generated this file (always 'Cufflinks') |
3 | feature | exon | The type of record (always either "transcript" or "exon".) |
4 | start | 77696957 | The leftmost coordinate of this record. |
5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
6 | score | 77712009 | Score that tells you how abundant this isoform is compared to other isoforms of the gene |
7 | strand | + | Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." |
7 | frame | . | Not used |
8 | attributes | ... | See below. |
- Each GTF record is decorated with the following attributes:
Attribute | Example | Description |
gene_id | CUFF.1 | Cufflinks gene id |
transcript_id | CUFF.1.1 | Cufflinks transcript id |
FPKM | 101.267 | Isoform-level relative abundance in FPKM |
frac | 0.7647 | Reserved. Please ignore. |
cov | 100.765 | Estimate for the absolute depth of read coverage across the whole transcript |
full_read_support | yes | When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data. |
Exercise: Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts.
The secret lies in the gene_id column. #For counting novel entries grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 54847 #For counting entries corresponding to annotated genes grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 88644 What do you think grep -v does?
Step 3: Cuffmerge
HOW WAS IT RUN?
We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
find . -name transcripts.gtf > assembly_list.txt #If you have a local copy: cat assembly_list.txt
cat run_commands/commands.cuffmerge cuffmerge -g reference/genes.exons.gtf assembly_list.txt
HOW DOES THE OUTPUT LOOK?
The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step.
#If you have a local copy: ls -l merged_asm drwx------ 2 daras G-801020 32768 May 28 05:36 logs -rw------- 1 daras G-801020 29449035 May 28 05:36 merged.gtf
The cuffmerge result also has information about how this transcript compares to annotated ones in your gtf file.
CLASS CODES
Priority | Code | Description | |
1 | = | Complete match of intron chain | |
2 | c | Contained | |
3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript | |
4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. | |
5 | i | A transfrag falling entirely within a reference intron | |
6 | o | Generic exonic overlap with a reference transcript | |
7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) | |
8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case | |
9 | u | Unknown, intergenic transcript | |
10 | x | Exonic overlap with reference on the opposite strand | |
11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
Exercise 1: Count how many potentially novel fragments we have in our data
grep 'class_code "j"' merged_asm/merged.gtf |wc -l 17441
Step 4: Cuffquant
HOW WAS IT RUN?
cat run_commands/commands.cuffquant
HOW DOES THE OUTPUT LOOK?
It makes one file per sample- abuandances.cxb. It's a binary file containing the quantification results for each sample, that are now ready for cuffdiff.
#If you have a local copy: ls -l cuffquant_C1_R1_out/ -rw------- 1 daras G-801020 18373486 May 28 05:24 abundances.cxb
Step 5: Cuffdiff
HOW WAS IT RUN?
#If you have a local copy: cat run_commands/commands.cuffdiff
#If you have a local copy: ls -l diff_out -rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff : Differential expression testing for transcripts -rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff : Differential expression testing for genes -rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts -rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff : Differential expression testing for coding sequences -rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking -rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking -rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking -rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking -rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff : Differential splicing tests -rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff : Differential promoter usage -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff : Differential coding output
PARSING CUFFDIFF OUTPUT
Exercise 3: What is the status column in our gene_exp.diff files? and what values does it have?
cat diff_out/gene_exp.diff|cut -f 7|sort|uniq
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.
cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10,10|head #Lets pull out just gene names cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10,10|head|cut -f 3
Exercise 4: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3
Exercise 5: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head #To pull out gene names: cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3
Exercise 6: Pull out and count the significantly differentially expressed genes do we have (qvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG wc -l DEG