All these steps have already been run. We'll be spending time looking at the commands and output. Let's get set up.
cds cd my_rnaseq_course cp -r /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results . cd cufflinks_results |
We've already gone over how tophat results look here so let's move on to step 2.
HOW WAS IT RUN?
#If you've copied it over successfully: cat run_commands/commands.cufflinks #If you don't have a local copy, you can read it from the source: cat /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/run_commands/commands.cufflinks |
HOW DOES THE OUTPUT LOOK?
Take a look at output for one of our samples, C1_R1. The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.
#If you have a local copy: ls -l C1_R1_clout #If you don't have a local copy: ls -l /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/C1_R1_clout -rw------- 1 daras G-801020 627673 May 17 16:58 genes.fpkm_tracking -rw------- 1 daras G-801020 1021025 May 17 16:58 isoforms.fpkm_tracking -rw------- 1 daras G-801020 0 May 17 16:50 skipped.gtf -rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf |
DESCRIPTION OF TRANSCRIPTS.GTF FILE
Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.
| Column number | Column name | Example | Description |
| 1 | seqname | chrX | Chromosome or contig name |
| 2 | source | Cufflinks | The name of the program that generated this file (always 'Cufflinks') |
| 3 | feature | exon | The type of record (always either "transcript" or "exon".) |
| 4 | start | 77696957 | The leftmost coordinate of this record. |
| 5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
| 6 | score | 77712009 | Score that tells you how abundant this isoform is compared to other isoforms of the gene |
| 7 | strand | + | Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." |
| 7 | frame | . | Not used |
| 8 | attributes | ... | See below. |
| Attribute | Example | Description |
| gene_id | CUFF.1 | Cufflinks gene id |
| transcript_id | CUFF.1.1 | Cufflinks transcript id |
| FPKM | 101.267 | Isoform-level relative abundance in FPKM |
| frac | 0.7647 | Reserved. Please ignore. |
| cov | 100.765 | Estimate for the absolute depth of read coverage across the whole transcript |
| full_read_support | yes | When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data. |
Exercise: Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts.
The secret lies in the gene_id column. #For counting novel entries grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 54847 #For counting entries corresponding to annotated genes grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 88644 What do you think grep -v does? |
HOW WAS IT RUN?
cat run_commands/commands.cuffcompare |
HOW DOES THE OUTPUT LOOK?
cuffcmp.tracking is the most important file to pay attention to. It tells you how the assembled transcript are related to known transcripts (from genes.gtf file)
#If you have a local copy: ls -l cmp_out -rw-rw-r-- 1 daras G-801020 26041012 May 21 01:34 cuffcmp.combined.gtf -rw-rw-r-- 1 daras G-801020 2544325 May 21 01:34 cuffcmp.loci -rw-rw-r-- 1 daras G-801020 6234 May 21 01:34 cuffcmp.stats -rw-rw-r-- 1 daras G-801020 10816986 May 21 01:34 cuffcmp.tracking |
DESCRIPTION OF *TRACKING FILE CLASS CODES
| Column name | Example | Description | |
| 1 | Cufflinks transfrag id | TCONS_00000045 | A unique internal id for the transfrag |
| 2 | Cufflinks locus id | XLOC_000023 | A unique internal id for the locus |
| 3 | Reference gene id | Tcea | The gene_name attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript |
| 4 | Reference transcript id | uc007afj.1 | The transcript_id attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript |
| 5 | Class code | c | The type of match between the Cufflinks transcripts in column 6 and the reference transcript. See below for class code info |
CLASS CODES
| Priority | Code | Description | |
| 1 | = | Complete match of intron chain | |
| 2 | c | Contained | |
| 3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript | |
| 4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. | |
| 5 | i | A transfrag falling entirely within a reference intron | |
| 6 | o | Generic exonic overlap with a reference transcript | |
| 7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) | |
| 8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case | |
| 9 | u | Unknown, intergenic transcript | |
| 10 | x | Exonic overlap with reference on the opposite strand | |
| 11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
Exercise 2: Count how many potentially novel fragments we have in our data
cat cmp_out/cuffcmp.tracking |awk '{if ($4 == "j")print}'|wc -l
556
Why did we choose awk here instead of grep? |
HOW WAS IT RUN?
We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
find . -name transcripts.gtf > assembly_list.txt #If you have a local copy: cat assembly_list.txt |
cat run_commands/commands.cuffmerge
cuffmerge -g reference/genes.exons.gtf assembly_list.txt |
HOW DOES THE OUTPUT LOOK?
The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step.
#If you have a local copy: ls -l merged_asm -rwxrwxr-x 1 daras G-803889 1571816 Aug 16 2012 genes.fpkm_tracking -rwxrwxr-x 1 daras G-803889 2281319 Aug 16 2012 isoforms.fpkm_tracking drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 logs -r-xrwxr-x 1 daras G-803889 32090408 Aug 16 2012 merged.gtf -rwxrwxr-x 1 daras G-803889 0 Aug 16 2012 skipped.gtf drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 tmp -rwxrwxr-x 1 daras G-803889 34844830 Aug 16 2012 transcripts.gtf |
HOW WAS IT RUN?
#If you have a local copy: cuffdiff -o diff_out -b reference/genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf C1_R1_thout/accepted_hits.bam,C1_R2_thout/accepted_hits.bam,C1_R3_thout/accepted_hits.bam C2_R1_thout/accepted_hits.bam,C2_R2_thout/accepted_hits.bam,C2_R3_thout/accepted_hits.bam |
#If you have a local copy: ls -l diff_out -rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff : Differential expression testing for transcripts -rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff : Differential expression testing for genes -rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts -rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff : Differential expression testing for coding sequences -rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking -rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking -rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking -rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking -rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff : Differential splicing tests -rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff : Differential promoter usage -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff : Differential coding output |
PARSING CUFFDIFF OUTPUT
Exercise 3: What is the status column in our gene_exp.diff files? and what values does it have?
cat diff_out/gene_exp.diff|cut -f 7|sort|uniq |
Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
|
Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head #Lets pull out just gene names cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head|cut -f 3 |
Top 10 upregulated genes scf crc KdelR Nacalpha CG8979 CG4389 Vha68-2 regucalcin CG3835 CG1746 |
Exercise 4: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3 |
Top 10 downregulated genes CG17065 Git CG18519 del CG13319 RNaseX25 msb1l achi CG15611 CG11309 |
Exercise 5: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head #To pull out gene names: cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |
sick mub CG5021 akirin CG4587 c(3)G Ank2 CG1674 CrebB-17A stai |
Exercise 6: Pull out and count the significantly differentially expressed genes do we have (pvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG
wc -l DEG |