Tuxedo Suite For Splice Variant Analysis and Identifying Novel Transcripts II
All these steps have already been run. We'll be spending time looking at the commands and output. Let's get set up.
Get to the results
cds cd my_rnaseq_course cd day_3_partB/cufflinks_results
Step 1: Tophat
We've already gone over how tophat results look here so let's move on to step 2.
Step 2: Cufflinks
HOW WAS IT RUN?
#If you've copied it over successfully: cat run_commands/commands.cufflinks #If you don't have a local copy, you can read it from the source: cat /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/run_commands/commands.cufflinks
HOW DOES THE OUTPUT LOOK?
Take a look at output for one of our samples, C1_R1. The important file is transcripts.gtf, which contains Tophat's assembled junctions for C1_R1.
#If you have a local copy: ls -l C1_R1_clout #If you don't have a local copy: ls -l /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/C1_R1_clout -rw------- 1 daras G-801020 627673 May 17 16:58 genes.fpkm_tracking -rw------- 1 daras G-801020 1021025 May 17 16:58 isoforms.fpkm_tracking -rw------- 1 daras G-801020 0 May 17 16:50 skipped.gtf -rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf
DESCRIPTION OF TRANSCRIPTS.GTF FILE
Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.
Column number | Column name | Example | Description |
1 | seqname | chrX | Chromosome or contig name |
2 | source | Cufflinks | The name of the program that generated this file (always 'Cufflinks') |
3 | feature | exon | The type of record (always either "transcript" or "exon".) |
4 | start | 77696957 | The leftmost coordinate of this record. |
5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
6 | score | 77712009 | Score that tells you how abundant this isoform is compared to other isoforms of the gene |
7 | strand | + | Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." |
7 | frame | . | Not used |
8 | attributes | ... | See below. |
- Each GTF record is decorated with the following attributes:
Attribute | Example | Description |
gene_id | CUFF.1 | Cufflinks gene id |
transcript_id | CUFF.1.1 | Cufflinks transcript id |
FPKM | 101.267 | Isoform-level relative abundance in FPKM |
frac | 0.7647 | Reserved. Please ignore. |
cov | 100.765 | Estimate for the absolute depth of read coverage across the whole transcript |
full_read_support | yes | When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data. |
Exercise: Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts.
The secret lies in the gene_id column. #For counting novel entries grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 54936 #For counting entries corresponding to annotated genes grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 88724 What do you think grep -v does?
Step 3: Cuffmerge
HOW WAS IT RUN?
We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
find . -name transcripts.gtf > assembly_list.txt #If you have a local copy: cat assembly_list.txt
cat run_commands/commands.cuffmerge cuffmerge -g reference/genes.exons.gtf assembly_list.txt
HOW DOES THE OUTPUT LOOK?
The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step.
#If you have a local copy: ls -l merged_asm drwx------ 2 daras G-801020 32768 May 28 05:36 logs -rw------- 1 daras G-801020 29449035 May 28 05:36 merged.gtf
The cuffmerge result also has information about how this transcript compares to annotated ones in your gtf file.
CLASS CODES
Priority | Code | Description | |
1 | = | Complete match of intron chain | |
2 | c | Contained | |
3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript | |
4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. | |
5 | i | A transfrag falling entirely within a reference intron | |
6 | o | Generic exonic overlap with a reference transcript | |
7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) | |
8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case | |
9 | u | Unknown, intergenic transcript | |
10 | x | Exonic overlap with reference on the opposite strand | |
11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
Exercise 1: Count how many potentially novel fragments we have in our data
grep 'class_code "j"' merged_asm/merged.gtf |wc -l 17441
Step 4: Cuffquant
HOW WAS IT RUN?
cat run_commands/commands.cuffquant
HOW DOES THE OUTPUT LOOK?
It makes one file per sample- abuandances.cxb. It's a binary file containing the quantification results for each sample, that are now ready for cuffdiff.
#If you have a local copy: ls -l cuffquant_C1_R1_out/ -rw------- 1 daras G-801020 18373486 May 28 05:24 abundances.cxb
Step 5: Cuffdiff
HOW WAS IT RUN?
#If you have a local copy: cat run_commands/commands.cuffdiff
#If you have a local copy: ls -l diff_out -rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff : Differential expression testing for transcripts -rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff : Differential expression testing for genes -rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts -rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff : Differential expression testing for coding sequences -rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking -rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking -rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking -rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking -rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff : Differential splicing tests -rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff : Differential promoter usage -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff : Differential coding output
PARSING CUFFDIFF OUTPUT
Exercise 3: What is the status column in our gene_exp.diff files? and what values does it have?
cat diff_out/gene_exp.diff|cut -f 7|sort|uniq
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.
cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10,10|head #Lets pull out just gene names cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10,10|head|cut -f 3
Exercise 4: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3
Exercise 5: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head #To pull out gene names: cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3
Exercise 6: Pull out and count the significantly differentially expressed genes do we have (qvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG wc -l DEG
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.