Page Comparison

Class Slides

Idev session and getting data

...

Step 1: Evaluate and manipulate raw data

Code Block

title	Count and Fastqc

cd quality_exercise
 
head data/Sample1_R1.fastq 

wc -l data/Sample1_R1.fastq 
grep -c '^@HWI' data/Sample1_R1.fastq 

module spider fastqc
module load fastqc
fastqc -h
#DONT RUN ON HEAD NODE:
fastqc data/Sample1_R1.fastq

Look at Fastqc results:

Ideal dataset: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html

Our dataset: http://web.corral.tacc.utexas.edu/BioITeam/rnaseq_course/fastqc_exercise/Sample1_R1_fastqc/fastqc_report.html

manipulate fastq files- adaptor trimming

Code Block

title	manipulate fastq files- quality trimming and filtering

module spider fastx
module load fastx_toolkit
#DONT RUN ON HEAD NODE:
fastx_trimmer -i data/Sample1_R1.fastq -l 90 -Q 33 -o Sample1_R1.trimmed.fastq
#DONT RUN ON HEAD NODE:
fastq_quality_filter -q 20 -p 80 -i data/Sample1_R1.fastq -Q 33 -o Sample1_R1.filtered.fastq


grep -c '^@HWI' results/Sample1_R1.trimmed.fastq 
grep -c '^@HWI' results/Sample1_R1.filtered.fastq

Code Block

title

fastx_clipper -h
#DONT RUN ON HEAD NODE:
/corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC Sample1_R1.fastq
/corral-repl/utexas/BioITeam/bin/cutadapt-1.3/bin/cutadapt -m 22 -a TGATCGTCGGACTGTAGAACTCTGAACGTGTAGA Sample1_R1.fastq

Step 2: Map Raw Data to Reference, Assess results

Code Block

title	Map with BWA

cd ..
cd mapping_exercise
module spider bwa
module load bwa/0.7.712


#DONT RUN ON HEAD NODE:
bwa index -a bwtsw reference/genometranscripts.fafasta

bwa mem

#DONT RUN ON HEAD NODE:
bwa mem ../reference/genometranscripts.fafasta ../data/GSM794483_C1_R1_1.fq data/GSM794483_C1_R1_2.fq > C1_R1.mem.sam

Code Block

title	Map with Tophat

module spider tophat
module load tophat/2.0.10
 hisat

#DONT RUN ON HEAD NODE
hisat2-build reference/genome.fa reference/genome.fa

 
#DONT RUN ON HEAD NODE:
tophathisat2 -p 2 -G x ../reference/genesgenome.gtffa -o C1_R1_thout reference/genome.fa 1 ../data/GSM794483_C1_R1_1.fq -2 ../data/GSM794483_C1_R1_2.fq -S GSM794483_C1.sam --phred33 --novel-splicesite-outfile GSM794483_C1.junctions --rna-strandness RF --dta -t

Code Block

title	Assess Mapping Results

module load samtools

#SYNTAX:
samtools view -b -S samfile > bamfile
samtools sort bamfile sortedbamfile
samtools index sortedbamfile


#BWA RESULTS
samtools flagstat bwa_mem_results_transcriptome/C1_R1.mem.bam
samtools idxstats bwa_mem_results_transcriptome/C1_R1.mem.bam


#SAMTOOLS RESULTS
samtools flagstat tophat_results/C1_R1_thout/accepted_hits.bam

#How is tophathisat2 result different from bwa
head -20 tophathisat_exercise/results/C1_R1_thout/accepted_hitsGSM794483_C1.sam
cutgrep -fv 6 tophat'^@' bwa_exercise/bwa_mem_results_transcriptome/C1_R1_thout/accepted_hits.mem.sam|grep 'N'|head
head|cut -f 6 bwa_results/C1_R1.mem.sam|grep 'N'|head
grep -v '^@' hisat_exercise/results/GSM794483_C1.sam|head|cut -f 6


#DONT RUN ON HEAD NODE
cutgrep -fv 6'^@' tophathisat_exercise/results/C1_R1_thout/accepted_hitsGSM794483_C1.sam|grep 'N'cut -f 6|wc -l

Step 3 and Step 4 : Quantify transcripts and ID DEGs

Code Block

title	Look at bedtools and DESeq results

cd deg_exercise/
 
module swap intel gcc/4.7.1
module load bedtools
bedtools
bedtools multicov
#DONT RUN ON HEAD NODE
bedtools multicov -bams hisat_results/C1_R1_thout/accepted_hits.sorted.bam hisat_results/C1_R2_thout/accepted_hits.sorted.bam hisat_results/C1_R3_thout/accepted_hits.sorted.bam hisat_results/C2_R1_thout/accepted_hits.sorted.bam hisat_results/C2_R2_thout/accepted_hits.sorted.bam hisat_results/C2_R3_thout/accepted_hits.sorted.bam -bed ../reference/genes.formatted.gtf > gene_counts.gff


#What's the difference between bedtools and htseq results?
 
#First look at the annotation file
head ../mapping_exercise/reference/genes.formatted.gtf 

#bedtools result 
head gene_counts.gff 
wc -l gene_counts.gff
 
#htseq result
head gene_counts.htseq.gff 
wc -l gene_counts.htseq.gff
 
#DESEQ2 result:
head deseq_results/DESeq-C1-vs-C2.csv

Step 2: Assemble Transcripts/Tuxedo Suite

CUFFLINKS OUTPUT: TRANSCRIPTS.GTF FILE

Each row corresponds to one transcript or an exon of a transcript. First 7 columns are standard gff/gtf columns, followed by some attributes added by cufflinks.

Column number	Column name	Example	Description
1	seqname	chrX	Chromosome or contig name
2	source	Cufflinks	The name of the program that generated this file (always 'Cufflinks')
3	feature	exon	The type of record (always either "transcript" or "exon".)
4	start	77696957	The leftmost coordinate of this record.
5	end	77712009	The rightmost coordinate of this record, inclusive.
6	score	77712009	Score that tells you how abundant this isoform is compared to other isoforms of the gene
7	strand	+	Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."
7	frame	.	Not used
8	attributes	...	See below.

Each GTF record is decorated with the following attributes:

Attribute	Example	Description
gene_id	CUFF.1	Cufflinks gene id
transcript_id	CUFF.1.1	Cufflinks transcript id
FPKM	101.267	Isoform-level relative abundance in FPKM
frac	0.7647	Reserved. Please ignore.

cov	100.765	Estimate for the absolute depth of read coverage across the whole transcript
full_read_support	yes	When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.

Code Block

title	Look at cufflinks results

cd cufflinks_cuffdiff_exercise
ls -l C1_R1_clout
head C1_R1_clout/transcripts.gtf

Find out how many entries in the transcripts.gtf file are from novel transcripts and how many from annotated transcripts

Code Block

title	Parsing transcripts.gtf file

The secret lies in
#Find the gene_id column.

#For counting novel entries 
grep 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

54847
 
#For counting entries corresponding to annotated genes
grep -v 'CUFF' C1_R1_clout/transcripts.gtf |wc -l

88644
 
What do you think grep -v does?

CUFFMERGE OUTPUT

Code Block

title	Look at CUFFMERGE results

head merged_asm/merged.gtf

CLASS CODES

Priority	Code	Description
1	=	Complete match of intron chain
2	c	Contained
3	j	Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
4	e	Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
5	i	A transfrag falling entirely within a reference intron
6	o	Generic exonic overlap with a reference transcript
7	p	Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
8	r	Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
9	u	Unknown, intergenic transcript
10	x	Exonic overlap with reference on the opposite strand
11	s	An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)

Code Block

title	Look at CUFFMERGE results

#count number of transcripts that are potentially novel
grep 'class_code "j"' merged_asm/merged.gtf|cut -f 9|cut -d ';' -f 5|sort|uniq|wc -l

CUFFDIFF OUTPUT

Code Block

title	Look at cuffdiff results

ls -l diff_out
head diff_out/gene_exp.diff

Here is a basic command useful for parsing/sorting the gene_exp.diff or isoform_exp.diff files:

Code Block

title	Linux one-liner for sorting cuffdiff output by log2 fold-change values

cat diff_out/isoform_exp.diff | awk '{print $10 "\t" $4}' | sort -n -r | head

Find the 10 most up-regulated genes, by fold change that are classified as significantly changed.

Code Block

title	One-line command to get 10 most up regulated genes

cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#Lets pull out just gene names
cat diff_out/gene_exp.diff |grep 'yes'|sort -k10nr,10|head|cut -f 3

ExpandSolutionSolution

Top 10 upregulated genes

scf

crc

KdelR

Nacalpha

CG8979

CG4389

Vha68-2

regucalcin

CG3835

CG1746

Find the 10 most down-regulated genes, by fold change that are classified as significantly changed.

Code Block

title	One-line command to get 10 most down regulated genes

cat diff_out/gene_exp.diff |grep 'yes'|sort -k10n,10|head|cut -f 3

ExpandSolutionSolution

Top 10 downregulated genes

CG17065

Git

CG18519

del

CG13319

RNaseX25

msb1l

achi

CG15611

CG11309

Find the 10 most up-regulated isoforms, by fold change that are classified as significantly changed. What genes do they belong to?

Code Block

title	One-line command to get 10 most up-regulated isoforms

cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head

 
#To pull out gene names:
cat diff_out/isoform_exp.diff |grep 'yes'|sort -k10nr,10|head |cut -f 3

ExpandSolutionSolution

sick

mub

CG5021

akirin

CG4587

c(3)G

Ank2

CG1674

CrebB-17A

stai

Pull out and count the significantly differentially expressed genes do we have (pvalue <=0.05 and abs fold change > 2 or log2 fold change > 1)?

Code Block

title	DEG list

cat diff_out/gene_exp.diff |grep 'yes' |awk '{if (($10 >= 1)||($10<=1)) print }' > DEG

wc -l DEGtop 10 upregulated genes
sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|sort -n -r -k3,3|cut -f 1,3|head
#If there's an idiosyncracy with sort 
sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csvsort -n -r -k3,3|grep -v 'e-0'|cut -f 1,3|head
 
#Find the top 10 downregulated genes
sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|sort -n -k3,3|cut -f 1,3|head
#If there's an idiosyncracy with sort 
sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|sort -n -k3,3|grep -v 'e-0'|cut -f 1,3|head
 
#Count the DEGs
sed 's/,/\t/g' deseq_results/DESeq-C1-vs-C2.csv|awk '{if ((($3>=1)||($3<=-1))&&($6<=0.05)) print $1,$3,$6}'|wc -l

Versions Compared

Old Version 23

New Version 24

Key

Class Slides

Idev session and getting data

Step 1: Evaluate and manipulate raw data