...
Step 2: Cufflinks
HOW WAS IT RUN?
| Expand | |||
|---|---|---|---|
| |||
#If you've copied it over successfully: cat run_commands/commands.cufflinks #If you don't have a local copy, you can read it from the source: cat /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/commands.cufflinks |
...
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy:
ls -l C1_R1_clout
#If you don't have a local copy:
ls -l /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/C1_R1_clout
-rw------- 1 daras G-801020 627673 May 17 16:58 genes.fpkm_tracking
-rw------- 1 daras G-801020 1021025 May 17 16:58 isoforms.fpkm_tracking
-rw------- 1 daras G-801020 0 May 17 16:50 skipped.gtf
-rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf
|
DESCRIPTION OF TRANSCRIPTS.GTF FILE
Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.
| Column number | Column name | Example | Description |
| 1 | seqname | chrX | Chromosome or contig name |
| 2 | source | Cufflinks | The name of the program that generated this file (always 'Cufflinks') |
| 3 | feature | exon | The type of record (always either "transcript" or "exon".) |
| 4 | start | 77696957 | The leftmost coordinate of this record. |
| 5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
| 6 | score | 77712009 | Score that tells you how abundant this isoform is compared to other isoforms of the gene |
| 7 | strand | + | Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." |
| 7 | frame | . | Not used |
| 8 | attributes | ... | See below. |
- Each GTF record is decorated with the following attributes:
| Attribute | Example | Description |
| gene_id | CUFF.1 | Cufflinks gene id |
| transcript_id | CUFF.1.1 | Cufflinks transcript id |
| FPKM | 101.267 | Isoform-level relative abundance in FPKM |
| frac | 0.7647 | Reserved. Please ignore. |
| conf_lo | 0.07 | Lower bound of the 95% confidence interval of the abundance of this isoform. |
| conf_hi | 0.1102 | Upper bound of the 95% confidence interval of the abundance of this isoform. |
| cov | 100.765 | Estimate for the absolute depth of read coverage across the whole transcript |
| full_read_support | yes | When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data. |
Exercise: Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts.
| Code Block | ||
|---|---|---|
| ||
The secret lies in the gene_id column. #For counting novel entries grep 'CUFF' transcripts.gtf |wc -l 54847 #For counting entries corresponding to annotated genes grep -v 'CUFF' transcripts.gtf |wc -l 88644 What do you think grep -v does? |
Step 3: Cuffcompare
HOW WAS IT RUN?
| Code Block | ||
|---|---|---|
| ||
cat run_commands/commands.cuffcompare
|
HOW DOES THE OUTPUT LOOK?
cuffcmp.tracking is the most important file to pay attention to. It tells you how the assembled transcript are related to known transcripts (from genes.gtf file)
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy:
ls -l cmp_out
-rw-rw-r-- 1 daras G-801020 26041012 May 21 01:34 cuffcmp.combined.gtf
-rw-rw-r-- 1 daras G-801020 2544325 May 21 01:34 cuffcmp.loci
-rw-rw-r-- 1 daras G-801020 6234 May 21 01:34 cuffcmp.stats
-rw-rw-r-- 1 daras G-801020 10816986 May 21 01:34 cuffcmp.tracking
|
Step 4: Cuffmerge
HOW WAS IT RUN?
We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
| Code Block | ||
|---|---|---|
| ||
find . -name transcripts.gtf > assembly_list.txt #If you have a local copy: cat assembly_list.txt #If you don't have a local copy: cat /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/assembly_list.txt |
| Code Block | ||
|---|---|---|
| ||
cat run_commands/commands.cuffmerge
cuffmerge -g reference/genes.exons.gtf assembly_list.txt |
...
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy: ls -l merged_asm #If you don't have a local copy: cat /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/merged_asm -rwxrwxr-x 1 daras G-803889 1571816 Aug 16 2012 genes.fpkm_tracking -rwxrwxr-x 1 daras G-803889 2281319 Aug 16 2012 isoforms.fpkm_tracking drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 logs -r-xrwxr-x 1 daras G-803889 32090408 Aug 16 2012 merged.gtf -rwxrwxr-x 1 daras G-803889 0 Aug 16 2012 skipped.gtf drwxrwxr-x 2 daras G-803889 32768 Aug 16 2012 tmp -rwxrwxr-x 1 daras G-803889 34844830 Aug 16 2012 transcripts.gtf |
Step
...
5: Cuffdiff
HOW WAS IT RUN?
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy:
cat run_commands/commands.cuffdiff
cuffdiff -o diff_out -b reference/genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf C1_R1_thout/accepted_hits.bam,C1_R2_thout/accepted_hits.bam,C1_R3_thout/accepted_hits.bam C2_R1_thout/accepted_hits.bam,C2_R2_thout/accepted_hits.bam,C2_R3_thout/accepted_hits.bam
#If you don't have a local copy:
cat /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/commands.cuffdiff |
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy: ls -l diff_out #If you don't have a local copy: cat /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/diff_out -rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff : Differential expression testing for transcripts -rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff : Differential expression testing for genes -rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts -rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff : Differential expression testing for coding sequences -rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking -rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking -rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking -rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking -rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff : Differential splicing tests -rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff : Differential promoter usage -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff : Differential coding output |
PARSING CUFFDIFF OUTPUT
Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:
| Code Block | ||
|---|---|---|
| ||
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
|
Exercise 1: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.
| Code Block | ||
|---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head #Lets pull out just gene names cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head|cut -f 3 |
| Expand | ||||
|---|---|---|---|---|
| ||||
Top 10 upregulated genes scf crc KdelR Nacalpha CG8979 CG4389 Vha68-2 regucalcin CG3835 CG1746 |
Exercise 2: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.
| Code Block | ||
|---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3 |
| Expand | ||||
|---|---|---|---|---|
| ||||
Top 10 downregulated genes CG17065 Git CG18519 del CG13319 RNaseX25 msb1l achi CG15611 CG11309 |
Exercise 3: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
| Code Block | ||
|---|---|---|
| ||
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head #To pull out gene names: cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |
| Expand | ||||
|---|---|---|---|---|
| ||||
sick mub CG5021 akirin CG4587 c(3)G Ank2 CG1674 CrebB-17A stai |