Content Comparison

...

Warning
The install commands may take several minutes to complete. You can read ahead while they run or even open a new terminal window and connect it to Lonestar and continue onward in the tutorial as you wait for R.

Code Block

title	Starting R and loading the modules for this tutorial

login1$ module load R
login1$ R

R version 2.1415.03 (20112013-1003-31)
01) -- "Security Blanket"
Copyright (C) 20112013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("http://bioconductor.org/biocLite.R")
...Warning >in biocLiteinstall.packages("DESeqBiocInstaller")
...
> biocLite("edgeR")
, repos = a["BioCsoft", "URL"]) :
  'lib = "/opt/apps/R/2.15.3/lib64/R/library"' is not writable
Would you like to use a personal library instead?  (y/n) y
Would you like to create a personal library
~/R/x86_64-unknown-linux-gnu-library/2.15
to install packages into?  (y/n) y
...
> biocLite("DESeq")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n

...

We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson.

Warning
Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!

...

Expand

	Please take me through all of the steps...
	Please take me through all of the steps...

Code Block
module load bowtie bowtie-build NC_017544.1.fasta NC_017544.1

Now create a bowtie_commands file that looks like this using nano or another text editor:

Code Block

bowtie -p 3 -S NC_017544.1 SRR034450.fastq -S SRR034450.sam
bowtie -p 3 -S NC_017544.1 SRR034451.fastq -S SRR034451.sam
bowtie -p 3 -S NC_017544.1 SRR034452.fastq SRR034452.sam
bowtie -p 3 -S SRR034452.sam
bowtie -p 3 -S NC_017544.1 SRR034453.fastq -S SRR034453.sam
 NC_017544.1 SRR034453.fastq SRR034453.sam

Remember that there are 12 processors per node on Lonestar, so we choose to use 3 for each of the 4 jobs with the -p 3 option.

Create the launcher script and run it:

Warning
Remember that you cannot qsub from within an idev shell!

Code Block
module load python launcher_creator.py -n bowtie -q development -cj bowtie_commands -t 0:30:00 qsub bowtie.sge

...

Code Block
head gene_counts.gff
Warning

Optional: HTseq

HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools.

Code Block


grep "^NC_017544" NC_017544.1.gff > count_ref.gff
samtools view SRR034450.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count1.gff
samtools view SRR034451.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff
samtools view SRR034452.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff
samtools view SRR034453.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff

join count1.gff count2.gff | join - count3.gff | join - count4.gff > gene_counts_HTseq.gff
#if you have many samples, use for-loop and join

gene_counts_HTseq.gff has 5 more lines than gene_counts.gff. Check out the last 5 lines. They are basic statistics.

Code Block
wc -l gene_counts_HTseq.gff tail gene_counts_HTseq.gff

The basic statistics (last 5 lines) is useful to know, but should be removed to use it as a input file for DEGseq

Code Block
head -2910 gene_counts_HTseq.gff > gene_counts_HTseq.tab

Finally, gene_counts_HTseq.tab is ready to use. HTseq-count is strand-specific in default. Therefore, read counts for each gene in gene_counts_HTseq.gff are approximately a half counts in gene_counts.gff for the corresponding gene.

Analyze differential gene expression

DESeq

DESeq Manual and Instructions

Our data that is cluttered with a lot of extra columns and one column stuffed with tag=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files.

We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl or Python script that would check to be sure that each line had the locus_tag (they do), among other things.

Code Block

title	Reformatting gene_counts.gff


head gene_counts.gff
sed 's/^.*locus_tag=//' gene_counts.gff > gene_counts.tab

After it has run, take a peek at the new file:

Code Block
head gene_counts.tab

Warning

title	Be very careful how you copy and paste from the example below.

Do not copy the > characters. Some commands are spread across multiple lines. The > are missing at the beginning of the lines after the first one in these cases. So this:

Code Block
> y <- c( 1:10 ) > y

Is the same as:

Code Block
> y <- c(1:10) > y

It's ok to copy across the multiple lines and paste into R as long as you get all the way to the closing parenthesis.

...

Code Block

title	Using edgeR

login1$ R
...
> library("edgeR")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> group <- factor(c("wt","mut","wt","mut"))
> dge = DGEList(counts=counts,group=group)
> dge <- estimateCommonDisp(dge)
> dge <- estimateTagwiseDisp(dge)
> et <- exactTest(dge)
> etp <- topTags(et, n=100000)
> etp$table$logFC = -etp$table$logFC

> pdf("edgeR-MA-plot.pdf")
> plot(
  etp$table$logCPM,
  etp$table$logFC,
  xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
  col = ifelse( etp$table$FDR < .1, "red", "black" ) )
> dev.off()

> write.csv(etp$table, "edgeR-wt-vs-mut.csv")
> dev.off()

> write.csv(etp$table, "q()
Save workspace image? [y/n/c]: n
login1$ head edgeR-wt-vs-mut.csv")

Note that the "FC" fold change calculated is initially the reverse of that for the DESeq example for the output here. It is wt relative to mut. To fix this, we put a negative in there for the log fold change.

...

Comparison

Compare the expression changes predicted by DESeq and edgeR to each other.
Does edgeR or DESeq predict more significant changes?

...

In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit, FAR, or cutadapt before aligningthe tools discussed in Evaluating your raw sequencing data.
You can get a lot more information from RNAseq data than you could from a microarray experiment. You can map transcriptional start sites, areas of unexpected transcription, splice sites, etc. - all because you have full sequence information that we have barely used in this example.
You can call variants from mapped RNAseq data, just be aware that many regions will have no coverage (because they are not expressed as RNA).

From here...

Visualize mapped reads in BAM files using IGV to manually check some of the gene counts.
Look at the more sophisticated "Tuxedo" suite of RNAseq tools, which performs many functions that are especially useful in Eukaryotic genomes.

Version	Old Version 65	New Version Current
Changes made by	Jeffrey E Barrick	Jeffrey E Barrick
Saved on	May 22, 2013	May 22, 2013

Versions Compared

Key

Optional: HTseq

Analyze differential gene expression

DESeq

Comparison

From here...