...
| Code Block | ||
|---|---|---|
| ||
cds cd my_rnaseq_course cp -r /corral-repl/utexas/BioITeam/rnaseq_coursecd day_3_partB/cufflinks_results . cd cufflinks_results |
Step 1: Tophat
We've already gone over how tophat results look here so let's move on to step 2.
...
| Code Block | ||
|---|---|---|
| ||
#If you've copied it over successfully: cat run_commands/commands.cufflinks #If you don't have a local copy, you can read it from the source: cat /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/run_commands/commands.cufflinks |
...
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy: ls -l C1_R1_clout #If you don't have a local copy: ls -l /corral-repl/utexas/BioITeam/rnaseq_course_2015/day_3_partB/cufflinks_results/C1_R1_clout -rw------- 1 daras G-801020 627673 May 17 16:58 genes.fpkm_tracking -rw------- 1 daras G-801020 1021025 May 17 16:58 isoforms.fpkm_tracking -rw------- 1 daras G-801020 0 May 17 16:50 skipped.gtf -rw------- 1 daras G-801020 14784740 May 17 16:58 transcripts.gtf |
...
DESCRIPTION OF TRANSCRIPTS.GTF FILE
...
| Code Block | ||
|---|---|---|
| ||
The secret lies in the gene_id column. #For counting novel entries grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 5484754936 #For counting entries corresponding to annotated genes grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 8864488724 What do you think grep -v does? |
Step 3:
...
Cuffmerge
HOW WAS IT RUN?
We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
| Code Block | ||
|---|---|---|
| ||
find . -name transcripts.gtf > assembly_list.txt
#If you have a local copy:
cat assembly_list.txt
|
| Code Block | ||
|---|---|---|
| ||
cat run_commands/commands.cuffcompare .cuffmerge cuffmerge -g reference/genes.exons.gtf assembly_list.txt |
HOW DOES THE OUTPUT LOOK?
...
The most important file
...
is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step.
| Code Block | |
|---|---|
|
...
| |
#If you have a local copy:
ls -l |
...
merged_ |
...
asm drwx- |
...
-- |
...
- |
...
-- |
...
2 daras G-801020 |
...
|
...
|
...
32768 May 28 05:36 logs -rw- |
...
- |
...
-- |
...
- |
...
- |
...
- |
...
1 daras G-801020 |
...
29449035 May |
...
28 |
...
05: |
...
36 |
...
DESCRIPTION OF *TRACKING FILE CLASS CODES
...
The gene_name attribute of the reference GTF record for this transcript,
or '-' if no reference transcript overlaps this Cufflinks transcript
...
The transcript_id attribute of the reference GTF record for this transcript,
or '-' if no reference transcript overlaps this Cufflinks transcript
...
The type of match between the Cufflinks transcripts in column 6 and
...
merged.gtf |
The cuffmerge result also has information about how this transcript compares to annotated ones in your gtf file.
CLASS CODES
| Priority | Code | Description | |
| 1 | = | Complete match of intron chain | |
| 2 | c | Contained | |
| 3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript | |
| 4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. | |
| 5 | i | A transfrag falling entirely within a reference intron | |
| 6 | o | Generic exonic overlap with a reference transcript | |
| 7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) | |
| 8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case | |
| 9 | u | Unknown, intergenic transcript | |
| 10 | x | Exonic overlap with reference on the opposite strand | |
| 11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
Exercise
...
1: Count how many potentially novel fragments we have in our data
| Code Block | ||
|---|---|---|
|
...
grep |
...
'class_code "j" |
...
' merged_asm/merged.gtf |wc -l |
...
17441 |
Step 4:
...
Cuffquant
HOW WAS IT RUN?
We first created a file listing the paths of all per-sample transcripts.gtf files so far, then pass that to cuffmerge:
| Code Block | ||
|---|---|---|
| ||
find . -name transcripts.gtf > assembly_list.txt
#If you have a local copy:
cat assembly_list.txt
|
| Code Block | ||
|---|---|---|
| ||
cat run_commands/commands.cuffmerge
cuffmerge -g reference/genes.exons.gtf assembly_list.txt |
cuffquant
|
HOW DOES THE OUTPUT LOOK?The most important file is merged.gif, which contains the consensus transcriptome annotations cuffmerge has calculated. This is the input for the next step
It makes one file per sample- abuandances.cxb. It's a binary file containing the quantification results for each sample, that are now ready for cuffdiff.
| Code Block | |
|---|---|
|
...
| |
#If you have a local copy:
ls -l |
...
cuffquant_C1_R1_out/ -rw------- 1 daras G- |
...
801020 |
...
18373486 |
...
May |
...
28 |
...
05:24 abundances.cxb
|
Step 5: Cuffdiff
HOW WAS IT RUN?
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy: cuffdiff -o diff_out -b reference/genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf C1_R1_thout/accepted_hits.bam,C1_R2_thout/accepted_hits.bam,C1_R3_thout/accepted_hits.bam C2_R1_thout/accepted_hits.bam,C2_R2_thout/accepted_hits.bam,C2_R3_thout/accepted_hits.bamcat run_commands/commands.cuffdiff |
| Code Block | ||
|---|---|---|
| ||
#If you have a local copy: ls -l diff_out -rwxr-x--- 1 daras G-801020 2691192 Aug 21 12:20 isoform_exp.diff : Differential expression testing for transcripts -rwxr-x--- 1 daras G-801020 1483520 Aug 21 12:20 gene_exp.diff : Differential expression testing for genes -rwxr-x--- 1 daras G-801020 1729831 Aug 21 12:20 tss_group_exp.diff: Differential expression testing for primary transcripts -rwxr-x--- 1 daras G-801020 1369451 Aug 21 12:20 cds_exp.diff : Differential expression testing for coding sequences -rwxr-x--- 1 daras G-801020 3277177 Aug 21 12:20 isoforms.fpkm_tracking -rwxr-x--- 1 daras G-801020 1628659 Aug 21 12:20 genes.fpkm_tracking -rwxr-x--- 1 daras G-801020 1885773 Aug 21 12:20 tss_groups.fpkm_tracking -rwxr-x--- 1 daras G-801020 1477492 Aug 21 12:20 cds.fpkm_tracking -rwxr-x--- 1 daras G-801020 1349574 Aug 21 12:20 splicing.diff : Differential splicing tests -rwxr-x--- 1 daras G-801020 1158560 Aug 21 12:20 promoters.diff : Differential promoter usage -rwxr-x--- 1 daras G-801020 919690 Aug 21 12:20 cds.diff : Differential coding output |
PARSING CUFFDIFF OUTPUT
Exercise 3: What is the status column in our gene_exp.diff files? and what values does it have?
| Code Block | ||
|---|---|---|
| ||
cat diff_out/gene_exp.diff|cut -f 7|sort|uniq |
Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:
| Code Block | ||
|---|---|---|
| ||
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
|
Exercise 3: Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.
| Code Block | ||
|---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nrn -r -k10,10|head #Lets pull out just gene names cat diff_out/gene_exp.diff |grep 'yes'|sort -n -r -k10nrk10,10|head|cut -f 3 |
| Expand | ||||
|---|---|---|---|---|
| ||||
Top 10 upregulated genes scf crc Nacalpha Df31 KdelR Nacalpha CG8979 CG4389 Vha68-2regucalcin CG3835 CG1746regucalcin |
Exercise 4: Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.
| Code Block | ||
|---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3 |
| Expand | ||||
|---|---|---|---|---|
| ||||
Top 10 downregulated genes CG17065HdacX GitNep2 CG18519RpS19bdel Amy-d CG13319CG6847 RNaseX25Aplip1 msb1lqua achiCG17065 CG15611CG31975 CG11309Git |
Exercise 5: Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
| Code Block | ||
|---|---|---|
| ||
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head
#To pull out gene names:
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3
|
| Expand | ||||
|---|---|---|---|---|
| ||||
sickCG30460 mubHmgs CG5021Lerp akirinkst CG4587 c(3)G Ank2 CG1674 CrebB-17A staiena crol Cnx99A CG42345 Nipsnap CG15814 |
Exercise 6: Pull out and count the significantly differentially expressed genes do we have (pvalue qvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?
| Code Block | ||
|---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG
wc -l DEG |