Class Slides
Idev session and getting data
...
Step 1: Evaluate and manipulate raw data
Code Block | ||
---|---|---|
| ||
cd quality_exercise head data/Sample1_R1.fastq wc -l data/Sample1_R1.fastq grep -c '^@HWI' data/Sample1_R1.fastq module spider fastqc module load fastqc fastqc -h #DONT RUN ON HEAD NODE: fastqc data/Sample1_R1.fastq |
Look at Fastqc results:
Ideal dataset: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html
Our dataset: http://web.corral.tacc.utexas.edu/BioITeam/rnaseq_course/fastqc_exercise/Sample1_R1_fastqc/fastqc_report.html
Code Block | ||
---|---|---|
| ||
module spider fastx module load fastx_toolkit #DONT RUN ON HEAD NODE: fastx_trimmer -i data/Sample1_R1.fastq -l 90 -Q 33 -o Sample1_R1.trimmed.fastq #DONT RUN ON HEAD NODE: fastq_quality_filter -q 20 -p 80 -i data/Sample1_R1.fastq -Q 33 -o Sample1_R1.filtered.fastq grep -c '^@HWI' results/Sample1_R1.trimmed.fastq grep -c '^@HWI' results/Sample1_R1.filtered.fastq | ||
Code Block | ||
title | manipulate fastq files- adaptor trimming||
fastx_clipper -h
#DONT RUN ON HEAD NODE:
/corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC Sample1_R1.fastq
/corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA Sample1_R1.fastq
|
Step 2: Map Raw Data to Reference, Assess results
Code Block | ||
---|---|---|
| ||
cd .. cd mapping_exercise module spider bwa module load bwa/0.7.712 #DONT RUN ON HEAD NODE: bwa index -a bwtsw reference/genometranscripts.fafasta bwa mem #DONT RUN ON HEAD NODE: bwa mem ../reference/genometranscripts.fafasta ../data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq > C1_R1.mem.sam |
Code Block | ||
---|---|---|
| ||
module spider tophat module load tophat/2.0.10 hisat #DONT RUN ON HEAD NODE hisat2-build reference/genome.fa reference/genome.fa #DONT RUN ON HEAD NODE: tophathisat2 -p 2 -G x ../reference/genesgenome.gtffa -o C1_R1_thout reference/genome.fa 1 ../data/GSM794483_C1_R1_1.fq -2 ../data/GSM794483_C1_R1_2.fq -S GSM794483_C1.sam --phred33 --novel-splicesite-outfile GSM794483_C1.junctions --rna-strandness RF --dta -t |
Code Block | ||
---|---|---|
| ||
module load samtools #SYNTAX: samtools view -b -S samfile > bamfile samtools sort bamfile sortedbamfile samtools index sortedbamfile #BWA RESULTS samtools flagstat bwa_mem_results_transcriptome/C1_R1.mem.bam samtools idxstats bwa_mem_results_transcriptome/C1_R1.mem.bam #SAMTOOLS RESULTS samtools flagstat tophat_results/C1_R1_thout/accepted_hits.bam #How is tophathisat2 result different from bwa head -20 tophathisat_exercise/results/C1_R1_thout/accepted_hitsGSM794483_C1.sam cutgrep -fv 6 tophat'^@' bwa_exercise/bwa_mem_results_transcriptome/C1_R1_thout/accepted_hits.mem.sam|grep 'N'|head head|cut -f 6 bwa_results/C1_R1.mem.sam|grep 'N'|head grep -v '^@' hisat_exercise/results/GSM794483_C1.sam|head|cut -f 6 #DONT RUN ON HEAD NODE cutgrep -fv 6'^@' tophathisat_exercise/results/C1_R1_thout/accepted_hitsGSM794483_C1.sam|grep 'N'cut -f 6|wc -l |
Code Block | ||
---|---|---|
| ||
cd deg_exercise/ module swap intel gcc/4.7.1 module load bedtools bedtools bedtools multicov #DONT RUN ON HEAD NODE bedtools multicov -bams hisat_results/C1_R1_thout/accepted_hits.sorted.bam hisat_results/C1_R2_thout/accepted_hits.sorted.bam hisat_results/C1_R3_thout/accepted_hits.sorted.bam hisat_results/C2_R1_thout/accepted_hits.sorted.bam hisat_results/C2_R2_thout/accepted_hits.sorted.bam hisat_results/C2_R3_thout/accepted_hits.sorted.bam -bed ../reference/genes.formatted.gtf > gene_counts.gff #What's the difference between bedtools and htseq results? #First look at the annotation file head ../mapping_exercise/reference/genes.formatted.gtf #bedtools result head gene_counts.gff wc -l gene_counts.gff #htseq result head gene_counts.htseq.gff wc -l gene_counts.htseq.gff #DESEQ2 result: head deseq_results/DESeq-C1-vs-C2.csv |
Step 2: Assemble Transcripts/Tuxedo Suite
CUFFLINKS OUTPUT: TRANSCRIPTS.GTF FILE
Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.
Column number | Column name | Example | Description |
1 | seqname | chrX | Chromosome or contig name |
2 | source | Cufflinks | The name of the program that generated this file (always 'Cufflinks') |
3 | feature | exon | The type of record (always either "transcript" or "exon".) |
4 | start | 77696957 | The leftmost coordinate of this record. |
5 | end | 77712009 | The rightmost coordinate of this record, inclusive. |
6 | score | 77712009 | Score that tells you how abundant this isoform is compared to other isoforms of the gene |
7 | strand | + | Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "." |
7 | frame | . | Not used |
8 | attributes | ... | See below. |
- Each GTF record is decorated with the following attributes:
Attribute | Example | Description |
gene_id | CUFF.1 | Cufflinks gene id |
transcript_id | CUFF.1.1 | Cufflinks transcript id |
FPKM | 101.267 | Isoform-level relative abundance in FPKM |
frac | 0.7647 | Reserved. Please ignore. |
cov | 100.765 | Estimate for the absolute depth of read coverage across the whole transcript |
full_read_support | yes | When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data. |
Code Block | ||
---|---|---|
| ||
cd cufflinks_cuffdiff_exercise ls -l C1_R1_clout head C1_R1_clout/transcripts.gtf |
Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts
Code Block | ||
---|---|---|
| ||
The secret lies in #Find the gene_id column. #For counting novel entries grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 54847 #For counting entries corresponding to annotated genes grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l 88644 What do you think grep -v does? |
CUFFMERGE OUTPUT
Code Block | ||
---|---|---|
| ||
head merged_asm/merged.gtf
|
CLASS CODES
Priority | Code | Description | |
1 | = | Complete match of intron chain | |
2 | c | Contained | |
3 | j | Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript | |
4 | e | Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment. | |
5 | i | A transfrag falling entirely within a reference intron | |
6 | o | Generic exonic overlap with a reference transcript | |
7 | p | Possible polymerase run-on fragment (within 2Kbases of a reference transcript) | |
8 | r | Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case | |
9 | u | Unknown, intergenic transcript | |
10 | x | Exonic overlap with reference on the opposite strand | |
11 | s | An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors) |
Code Block | ||
---|---|---|
| ||
#count number of transcripts that are potentially novel
grep 'class_code "j"' merged_asm/merged.gtf|cut -f 9|cut -d ';' -f 5|sort|uniq|wc -l |
CUFFDIFF OUTPUT
Code Block | ||
---|---|---|
| ||
ls -l diff_out
head diff_out/gene_exp.diff
|
Here is a basic command useful for parsing/sorting the gene_exp.diff
or isoform_exp.diff
files:
Code Block | ||
---|---|---|
| ||
cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head
|
Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.
Code Block | ||
---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head
#Lets pull out just gene names
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head|cut -f 3 |
Top 10 upregulated genes
scf
crc
KdelR
Nacalpha
CG8979
CG4389
Vha68-2
regucalcin
CG3835
CG1746
Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.
Code Block | ||
---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3 |
Top 10 downregulated genes
CG17065
Git
CG18519
del
CG13319
RNaseX25
msb1l
achi
CG15611
CG11309
Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?
Code Block | ||
---|---|---|
| ||
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head
#To pull out gene names:
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3 |
sick
mub
CG5021
akirin
CG4587
c(3)G
Ank2
CG1674
CrebB-17A
stai
Pull out and count the significantly differentially expressed genes do we have (pvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?
Code Block | ||
---|---|---|
| ||
cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG wc -l DEGtop 10 upregulated genes sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|sort -n -r -k3,3|cut -f 1,3|head #If there's an idiosyncracy with sort sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csvsort -n -r -k3,3|grep -v 'e-0'|cut -f 1,3|head #Find the top 10 downregulated genes sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|sort -n -k3,3|cut -f 1,3|head #If there's an idiosyncracy with sort sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|sort -n -k3,3|grep -v 'e-0'|cut -f 1,3|head #Count the DEGs sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|awk '{if ((($3>=1)||($3<=-1))&&($6<=0.05)) print $1,$3,$6}'|wc -l |