Tuxedo Suite For Splice Variant Analysis and Identifying Novel Transcripts II

All these steps have already been run. We'll be spending time looking at the commands and output. Let's get set up.

Get to the results

Get set up
 
cds
cd my_rnaseq_course
cd day_3_partB/cufflinks_results

Step 1: Tophat

We've already gone over how tophat results look here so let's move on to step 2. 

HOW WAS IT RUN?

 

commands.cufflinks
#If you've copied it over successfully:
cat run_commands/commands.cufflinks
 
#If you don't have a local copy, you can read it from the source: 
cat /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/run_commands/commands.cufflinks

 

HOW DOES THE OUTPUT LOOK?

 Take a look at output for one of our samples, C1_R1. The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.

Cufflinks output files
#If you have a local copy:
ls -l C1_R1_clout
 
 
#If you don't have a local copy:
ls -l /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/C1_R1_clout

-rw------- 1 daras G-801020   627673 May 17 16:58 genes.fpkm_tracking
-rw------- 1 daras G-801020  1021025 May 17 16:58 isoforms.fpkm_tracking
-rw------- 1 daras G-801020        0 May 17 16:50 skipped.gtf
-rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf

DESCRIPTION OF TRANSCRIPTS.GTF FILE

Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.

Column numberColumn nameExampleDescription
1seqnamechrXChromosome or contig name
2sourceCufflinksThe name of the program that generated this file (always 'Cufflinks')
3featureexonThe type of record (always either "transcript" or "exon".)
4start77696957The leftmost coordinate of this record.
5end77712009The rightmost coordinate of this record, inclusive.
6score77712009Score that tells you how abundant this isoform is compared to other isoforms of the gene
7strand+Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."
7frame.Not used
8attributes...See below.

 

    Each GTF record is decorated with the following attributes:
AttributeExampleDescription
gene_idCUFF.1Cufflinks gene id
transcript_idCUFF.1.1Cufflinks transcript id
FPKM101.267Isoform-level relative abundance in FPKM
frac0.7647Reserved. Please ignore.
   
cov100.765Estimate for the absolute depth of read coverage across the whole transcript
full_read_supportyes

When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.

Exercise: Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts.

Parsing transcripts.gtf file
The secret lies in the gene_id column.

#For counting novel entries 
grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

54936
 
#For counting entries corresponding to annotated genes
grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

88724
 
What do you think grep -v does?

Step 3: Cuffmerge

HOW WAS IT RUN?

We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:

How did we do that?
find . -name transcripts.gtf > assembly_list.txt
 
#If you have a local copy:
cat assembly_list.txt

commands.cuffmerge file
cat run_commands/commands.cuffmerge

cuffmerge  -g reference/genes.exons.gtf  assembly_list.txt

 

HOW DOES THE OUTPUT LOOK?

The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step.

cuffmerge output
#If you have a local copy:
ls -l merged_asm

drwx------ 2 daras G-801020    32768 May 28 05:36 logs
-rw------- 1 daras G-801020 29449035 May 28 05:36 merged.gtf

The cuffmerge result also has information about how this transcript compares to annotated ones in your gtf file.

CLASS CODES

PriorityCodeDescription
1=Complete match of intron chain
2cContained 
3jPotentially novel isoform (fragment): at least one splice junction is shared with a reference transcript 
4eSingle exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. 
5iA transfrag falling entirely within a reference intron 
6oGeneric exonic overlap with a reference transcript 
7pPossible polymerase run-on fragment (within 2Kbases of a reference transcript) 
8rRepeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case 
9uUnknown, intergenic transcript 
10xExonic overlap with reference on the opposite strand 
11sAn intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)


Exercise 1: Count how many potentially novel fragments we have in our data

parse cuffcompare output
grep 'class_code "j"' merged_asm/merged.gtf |wc -l

17441

Step 4: Cuffquant

HOW WAS IT RUN?

commands.cuffcompare
cat run_commands/commands.cuffquant


HOW DOES THE OUTPUT LOOK?

It makes one file per sample- abuandances.cxb. It's a binary file containing the quantification results for each sample, that are now ready for cuffdiff.

cuffcompare output
#If you have a local copy:
ls -l cuffquant_C1_R1_out/

-rw------- 1 daras G-801020 18373486 May 28 05:24 abundances.cxb

 

Step 5: Cuffdiff

HOW WAS IT RUN?

 

commands.cuffdiff file
#If you have a local copy:
cat run_commands/commands.cuffdiff 


HOW DOES THE OUTPUT LOOK?
cuffdiff output
#If you have a local copy:
ls -l diff_out

-rwxr-x--- 1 daras G-801020  2691192 Aug 21 12:20 isoform_exp.diff  : Differential expression testing for transcripts
-rwxr-x--- 1 daras G-801020  1483520 Aug 21 12:20 gene_exp.diff     : Differential expression testing for genes
-rwxr-x--- 1 daras G-801020  1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts
-rwxr-x--- 1 daras G-801020  1369451 Aug 21 12:20 cds_exp.diff      : Differential expression testing for coding sequences
 
-rwxr-x--- 1 daras G-801020  3277177 Aug 21 12:20 isoforms.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1628659 Aug 21 12:20 genes.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1885773 Aug 21 12:20 tss_groups.fpkm_tracking
-rwxr-x--- 1 daras G-801020  1477492 Aug 21 12:20 cds.fpkm_tracking
 
-rwxr-x--- 1 daras G-801020  1349574 Aug 21 12:20 splicing.diff  : Differential splicing tests
-rwxr-x--- 1 daras G-801020  1158560 Aug 21 12:20 promoters.diff : Differential promoter usage
-rwxr-x--- 1 daras G-801020   919690 Aug 21 12:20 cds.diff       : Differential coding output


PARSING CUFFDIFF OUTPUT

Exercise 3: What is the status column in our gene_exp.diff files? and what values does it have?

status
cat diff_out/gene_exp.diff|cut -f 7|sort|uniq

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

Linux one-liner for sorting cuffdiff output by log2 fold-change values
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed. 

One-line command to get 10 most up regulated genes
cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10,10|head

 
#Lets pull out just gene names
cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10,10|head|cut -f 3
 Solution

Top 10 upregulated genes

scf

crc

Nacalpha

Df31

KdelR

CG8979

CG4389

Vha68-2

CG3835

regucalcin


Exercise 4: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed. 

One-line command to get 10 most down regulated genes
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3
 Solution

Top 10 downregulated genes

HdacX

Nep2

RpS19b

Amy-d

CG6847

Aplip1

qua

CG17065

CG31975

Git

 

Exercise 5: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?

One-line command to get 10 most up-regulated isoforms
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#To pull out gene names:
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3
 Solution

CG30460

Hmgs

Lerp

kst

ena

crol

Cnx99A

CG42345

Nipsnap

CG15814

Exercise 6: Pull out and count the significantly differentially expressed genes do we have (qvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?

DEG list
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG

wc -l DEG