Content Comparison

...

Code Block

title	Get set up

 
cds
cd my_rnaseq_course
cp -r /corral-repl/utexas/BioITeam/rnaseq_coursecd day_3_partB/cufflinks_results .
 
 
cd cufflinks_results

Step 1: Tophat

We've already gone over how tophat results look here so let's move on to step 2.

...

Code Block

title	commands.cufflinks

#If you've copied it over successfully:
cat run_commands/commands.cufflinks
 
#If you don't have a local copy, you can read it from the source: 
cat /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/run_commands/commands.cufflinks

...

Code Block

title	Cufflinks output files

#If you have a local copy:
ls -l C1_R1_clout
 
 
#If you don't have a local copy:
ls -l /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/C1_R1_clout

-rw------- 1 daras G-801020   627673 May 17 16:58 genes.fpkm_tracking
-rw------- 1 daras G-801020  1021025 May 17 16:58 isoforms.fpkm_tracking
-rw------- 1 daras G-801020        0 May 17 16:50 skipped.gtf
-rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf

...

DESCRIPTION OF TRANSCRIPTS.GTF FILE

...

Code Block

title	Parsing transcripts.gtf file

The secret lies in the gene_id column.

#For counting novel entries 
grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

5484754936
 
#For counting entries corresponding to annotated genes
grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

8864488724
 
What do you think grep -v does?

Step 3:

...

Cuffmerge

HOW WAS IT RUN?

We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:

Code Block

title	How did we do that?

find . -name transcripts.gtf > assembly_list.txt
 
#If you have a local copy:
cat assembly_list.txt

Code Block

title	commands.cuffcomparecuffmerge file

cat run_commands/commands.cuffcompare
.cuffmerge

cuffmerge  -g reference/genes.exons.gtf  assembly_list.txt

HOW DOES THE OUTPUT LOOK?

...

The most important file

...

is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step.

Code Block

title

...

cuffmerge output

#If you have a local copy:
ls -l

...

merged_

...

asm

drwx-

...

--

...

--

...

2 daras G-801020

...

 32768 May 28 05:36 logs
-rw-

...

--

...

 1 daras G-801020

...

29449035 May

...

05:

...

DESCRIPTION OF *TRACKING FILE CLASS CODES

...

The gene_name attribute of the reference GTF record for this transcript,

or '-' if no reference transcript overlaps this Cufflinks transcript

...

The transcript_id attribute of the reference GTF record for this transcript,

or '-' if no reference transcript overlaps this Cufflinks transcript

...

The type of match between the Cufflinks transcripts in column 6 and

...

merged.gtf

The cuffmerge result also has information about how this transcript compares to annotated ones in your gtf file.

CLASS CODES

Priority	Code	Description
1	=	Complete match of intron chain
2	c	Contained
3	j	Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
4	e	Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
5	i	A transfrag falling entirely within a reference intron
6	o	Generic exonic overlap with a reference transcript
7	p	Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
8	r	Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
9	u	Unknown, intergenic transcript
10	x	Exonic overlap with reference on the opposite strand
11	s	An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)

Exercise

...

1: Count how many potentially novel fragments we have in our data

Code Block

title	parse cuffcompare output

...

grep

...

'class_code "j"

...

' merged_asm/merged.gtf |wc -l

...

Step 4:

...

Cuffquant

HOW WAS IT RUN?

We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:

Code Block

title	How did we do that?

find . -name transcripts.gtf > assembly_list.txt
 
#If you have a local copy:
cat assembly_list.txt

Code Block

title	commands.cuffmerge filecuffcompare

cat run_commands/commands.cuffmerge

 
     cuffmerge  -g reference/genes.exons.gtf  assembly_list.txt

cuffquant

HOW DOES THE OUTPUT LOOK?The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step

It makes one file per sample- abuandances.cxb. It's a binary file containing the quantification results for each sample, that are now ready for cuffdiff.

Code Block

title

...

cuffcompare output

#If you have a local copy:
ls -l

...

cuffquant_C1_R1_out/

-rw------- 1 daras G-

...

18373486

...

May

...

05:24 abundances.cxb

Step 5: Cuffdiff

HOW WAS IT RUN?

Code Block

title	commands.cuffdiff file

#If you have a local copy:

 
cuffdiff -o diff_out -b reference/genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf C1_R1_thout/accepted_hits.bam,C1_R2_thout/accepted_hits.bam,C1_R3_thout/accepted_hits.bam C2_R1_thout/accepted_hits.bam,C2_R2_thout/accepted_hits.bam,C2_R3_thout/accepted_hits.bamcat run_commands/commands.cuffdiff

HOW DOES THE OUTPUT LOOK?

Code Block

title	cuffdiff output

#If you have a local copy:
ls -l diff_out

-rwxr-x--- 1 daras G-801020  2691192 Aug 21 12:20 isoform_exp.diff  : Differential expression testing for transcripts
-rwxr-x--- 1 daras G-801020  1483520 Aug 21 12:20 gene_exp.diff     : Differential expression testing for genes
-rwxr-x--- 1 daras G-801020  1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts
-rwxr-x--- 1 daras G-801020  1369451 Aug 21 12:20 cds_exp.diff      : Differential expression testing for coding sequences
 
-rwxr-x--- 1 daras G-801020  3277177 Aug 21 12:20 isoforms.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1628659 Aug 21 12:20 genes.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1885773 Aug 21 12:20 tss_groups.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1477492 Aug 21 12:20 cds.fpkm_tracking
 
-rwxr-x--- 1 daras G-801020  1349574 Aug 21 12:20 splicing.diff  : Differential splicing tests
-rwxr-x--- 1 daras G-801020  1158560 Aug 21 12:20 promoters.diff : Differential promoter usage
-rwxr-x--- 1 daras G-801020   919690 Aug 21 12:20 cds.diff       : Differential coding output

PARSING CUFFDIFF OUTPUT

Exercise 3: What is the status column in our gene_exp.diff files? and what values does it have?

Code Block

title	status

cat diff_out/gene_exp.diff|cut -f 7|sort|uniq

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

Code Block

title	Linux one-liner for sorting cuffdiff output by log2 fold-change values

cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.

Code Block

title	One-line command to get 10 most up regulated genes

cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nrn -r -k10,10|head

 
#Lets pull out just gene names
cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10nrk10,10|head|cut -f 3

Expand

	Solution
	Solution

Top 10 upregulated genes

scf

crc

Nacalpha

Df31

KdelR

Nacalpha

CG8979

CG4389

Vha68-2regucalcin

CG3835

CG1746regucalcin

Exercise 4: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.

Code Block

title	One-line command to get 10 most down regulated genes

cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3

Expand

	Solution
	Solution

Top 10 downregulated genes

CG17065HdacX

GitNep2

CG18519RpS19bdel

Amy-d

CG13319CG6847

RNaseX25Aplip1

msb1lqua

achiCG17065

CG15611CG31975

CG11309Git

Exercise 5: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?

Code Block

title	One-line command to get 10 most up-regulated isoforms

cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#To pull out gene names:
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3

Expand

	Solution
	Solution

sickCG30460

mubHmgs

CG5021Lerp

akirinkst

CG4587

c(3)G

Ank2

CG1674

CrebB-17A

staiena

crol

Cnx99A

CG42345

Nipsnap

CG15814

Exercise 6: Pull out and count the significantly differentially expressed genes do we have (pvalue qvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?

Code Block

title	DEG list

cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG

wc -l DEG

Version	Old Version 1	New Version Current
Changes made by	Dhivya Arasappan	Dhivya Arasappan
Saved on	May 25, 2015	May 25, 2016

Versions Compared

Key

Step 1: Tophat

Step 3:

Cuffmerge

Step 4:

Cuffquant

Step 5: Cuffdiff