View Source

All these steps have already been run. We'll be spending time looking at the commands and output. Let's get set up.

Get to the results

cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results .
 
 
cd cufflinks_results

Step 1: Tophat

We've already gone over how tophat results look here so let's move on to step 2.

Step 2: Cufflinks

HOW WAS IT RUN?

#If you've copied it over successfully:
cat run_commands/commands.cufflinks
 
#If you don't have a local copy, you can read it from the source: 
cat /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/run_commands/commands.cufflinks

HOW DOES THE OUTPUT LOOK?

Take a look at output for one of our samples, C1_R1. The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.

#If you have a local copy:
ls -l C1_R1_clout
 
 
#If you don't have a local copy:
ls -l /corral-repl/utexas/BioITeam/rnaseq_course/cufflinks_results/C1_R1_clout

-rw------- 1 daras G-801020   627673 May 17 16:58 genes.fpkm_tracking
-rw------- 1 daras G-801020  1021025 May 17 16:58 isoforms.fpkm_tracking
-rw------- 1 daras G-801020        0 May 17 16:50 skipped.gtf
-rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf

DESCRIPTION OF TRANSCRIPTS.GTF FILE

Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.

Column number	Column name	Example	Description
1	seqname	chrX	Chromosome or contig name
2	source	Cufflinks	The name of the program that generated this file (always 'Cufflinks')
3	feature	exon	The type of record (always either "transcript" or "exon".)
4	start	77696957	The leftmost coordinate of this record.
5	end	77712009	The rightmost coordinate of this record, inclusive.
6	score	77712009	Score that tells you how abundant this isoform is compared to other isoforms of the gene
7	strand	+	Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."
7	frame	.	Not used
8	attributes	...	See below.

Each GTF record is decorated with the following attributes:

Attribute	Example	Description
gene_id	CUFF.1	Cufflinks gene id
transcript_id	CUFF.1.1	Cufflinks transcript id
FPKM	101.267	Isoform-level relative abundance in FPKM
frac	0.7647	Reserved. Please ignore.

cov	100.765	Estimate for the absolute depth of read coverage across the whole transcript
full_read_support	yes	When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.

Exercise: Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts.

The secret lies in the gene_id column.

#For counting novel entries 
grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

54847
 
#For counting entries corresponding to annotated genes
grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

88644
 
What do you think grep -v does?

Step 3: Cuffcompare

HOW WAS IT RUN?

cat run_commands/commands.cuffcompare

HOW DOES THE OUTPUT LOOK?

cuffcmp.tracking is the most important file to pay attention to. It tells you how the assembled transcript are related to known transcripts (from genes.gtf file)

#If you have a local copy:
ls -l cmp_out

-rw-rw-r-- 1 daras G-801020 26041012 May 21 01:34 cuffcmp.combined.gtf
-rw-rw-r-- 1 daras G-801020  2544325 May 21 01:34 cuffcmp.loci
-rw-rw-r-- 1 daras G-801020     6234 May 21 01:34 cuffcmp.stats
-rw-rw-r-- 1 daras G-801020 10816986 May 21 01:34 cuffcmp.tracking

DESCRIPTION OF *TRACKING FILE CLASS CODES

Column name	Example	Description
1	Cufflinks transfrag id	TCONS_00000045	A unique internal id for the transfrag
2	Cufflinks locus id	XLOC_000023	A unique internal id for the locus
3	Reference gene id	Tcea	The gene_name attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript
4	Reference transcript id	uc007afj.1	The transcript_id attribute of the reference GTF record for this transcript, or '-' if no reference transcript overlaps this Cufflinks transcript
5	Class code	c	The type of match between the Cufflinks transcripts in column 6 and the reference transcript. See below for class code info

CLASS CODES

Priority	Code	Description
1	=	Complete match of intron chain
2	c	Contained
3	j	Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
4	e	Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
5	i	A transfrag falling entirely within a reference intron
6	o	Generic exonic overlap with a reference transcript
7	p	Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
8	r	Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
9	u	Unknown, intergenic transcript
10	x	Exonic overlap with reference on the opposite strand
11	s	An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)

Exercise 2: Count how many potentially novel fragments we have in our data

cat cmp_out/cuffcmp.tracking |awk '{if ($4 == "j")print}'|wc -l

556
 
Why did we choose awk here instead of grep?

Step 4: Cuffmerge

HOW WAS IT RUN?

We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:

find . -name transcripts.gtf > assembly_list.txt
 
#If you have a local copy:
cat assembly_list.txt

cat run_commands/commands.cuffmerge

 
     cuffmerge  -g reference/genes.exons.gtf  assembly_list.txt

HOW DOES THE OUTPUT LOOK?

The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step.

#If you have a local copy:
ls -l merged_asm

-rwxrwxr-x  1 daras G-803889  1571816 Aug 16  2012 genes.fpkm_tracking
-rwxrwxr-x  1 daras G-803889  2281319 Aug 16  2012 isoforms.fpkm_tracking
drwxrwxr-x  2 daras G-803889    32768 Aug 16  2012 logs
-r-xrwxr-x  1 daras G-803889 32090408 Aug 16  2012 merged.gtf
-rwxrwxr-x  1 daras G-803889        0 Aug 16  2012 skipped.gtf
drwxrwxr-x  2 daras G-803889    32768 Aug 16  2012 tmp
-rwxrwxr-x  1 daras G-803889 34844830 Aug 16  2012 transcripts.gtf

Step 5: Cuffdiff

HOW WAS IT RUN?

#If you have a local copy:

 
cuffdiff -o diff_out -b reference/genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf C1_R1_thout/accepted_hits.bam,C1_R2_thout/accepted_hits.bam,C1_R3_thout/accepted_hits.bam C2_R1_thout/accepted_hits.bam,C2_R2_thout/accepted_hits.bam,C2_R3_thout/accepted_hits.bam

HOW DOES THE OUTPUT LOOK?

#If you have a local copy:
ls -l diff_out

-rwxr-x--- 1 daras G-801020  2691192 Aug 21 12:20 isoform_exp.diff  : Differential expression testing for transcripts
-rwxr-x--- 1 daras G-801020  1483520 Aug 21 12:20 gene_exp.diff     : Differential expression testing for genes
-rwxr-x--- 1 daras G-801020  1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts
-rwxr-x--- 1 daras G-801020  1369451 Aug 21 12:20 cds_exp.diff      : Differential expression testing for coding sequences
 
-rwxr-x--- 1 daras G-801020  3277177 Aug 21 12:20 isoforms.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1628659 Aug 21 12:20 genes.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1885773 Aug 21 12:20 tss_groups.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1477492 Aug 21 12:20 cds.fpkm_tracking
 
-rwxr-x--- 1 daras G-801020  1349574 Aug 21 12:20 splicing.diff  : Differential splicing tests
-rwxr-x--- 1 daras G-801020  1158560 Aug 21 12:20 promoters.diff : Differential promoter usage
-rwxr-x--- 1 daras G-801020   919690 Aug 21 12:20 cds.diff       : Differential coding output

PARSING CUFFDIFF OUTPUT

Exercise 3: What is the status column in our gene_exp.diff files? and what values does it have?

cat diff_out/gene_exp.diff|cut -f 7|sort|uniq

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.

cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#Lets pull out just gene names
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head|cut -f 3

Top 10 upregulated genes

scf

crc

KdelR

Nacalpha

CG8979

CG4389

Vha68-2

regucalcin

CG3835

CG1746

Exercise 4: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.

cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3

Top 10 downregulated genes

CG17065

Git

CG18519

del

CG13319

RNaseX25

msb1l

achi

CG15611

CG11309

Exercise 5: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?

cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#To pull out gene names:
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

sick

mub

CG5021

akirin

CG4587

c(3)G

Ank2

CG1674

CrebB-17A

stai

Exercise 6: Pull out and count the significantly differentially expressed genes do we have (pvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?

cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG

wc -l DEG