Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Do not copy the > characters in the below example. They indicate are the R prompt to remind you which commands you are running the inside of R!

The commands for this example are also described in the DESeq vignette (PDF) .

Code Block
titleUsing DESeq
login1$ R
...
> library("DESeq")
> counts = read.tabledelim("gene_counts.tab", header=F, row.names=1, sep="\t")
> head(counts)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> my.design <- data.frame(
  row.names = colnames( counts ),
  condition = c( "wt", "mut", "wt", "mut"),
  libType = c( "single-end", "single-end", "single-end", "single-end" ) 
)
> conds <- factor(my.design$condition)

> cds <- newCountDataSet( counts, conds )
> cds

> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )

> cds <- estimateDispersions( cds )

> pdf("DESeq-dispersion_estimates.pdf")
> plot(
  rowMeans( counts( cds, normalized=TRUE ) ),
  fitInfo(cds)$perGeneDispEsts,
  pch = '.', log="xy" )
  xg <- 10^seq( -.5, 5, length.out=300 )
  lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )
)
> dev.off()

> result <- nbinomTest( cds, "wt", "mut" )
> head(result)

> result = result[order(result$pval), ]
> head(result)

> write.csv(result, "DESeq-wt-vs-mut.csv")

> pdf("DESeq-MA-plot.pdf")
> plot(
  result$baseMean,
  result$log2FoldChange,
  log="x", pch=20, cex=.3,
  col = ifelse( result$padj < .1, "red", "black" ) )
> dev.off()

> q()
Save workspace image? [y/n/c]: n
login1$ head wt-vs-mut.csv

...

  • What are the numbers returned by sizeFactors( cds )?
    Expand
    Answer...
    Answer...

    They are roughly speaking the relative average coverage of each data set? There are roughly 5 times as many counts of reads in genes for wt2 as there are for mut2. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file.

  • What are the dispersion estimates?
    Expand
    Answer...
    Answer...

    The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). The program fits a model where the lower the counts are the more dispersion is expected (red line in graph), and thus the less significant a change in counts becomes.

  • What was the predominant effect of the mutation on gene expression in this Listeria strain?

Analyze differential gene expression (edgeR)

edgeR is another R package that you can use to do a similar analysis.

These commands use the negative binomial model, calculate the false discovery rate (FDR ~ adjusted p-value), and make a plot similar to the one from DEseq.

Code Block
titleUsing edgeR

login1$ R
...
> library("edgeR")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> group <- factor(c("wt","mut","wt","mut"))
> dge = DGEList(counts=counts,group=group)
> dge <- estimateCommonDisp(dge)
> dge <- estimateTagwiseDisp(dge)
> et <- exactTest(dge)
> etp <- topTags(et, n=100000)

> pdf("edgeR-MA-plot.pdf")
> plot(
  etp$table$logCPM,
  -etp$table$logFC,
  xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
  col = ifelse( etp$table$FDR < .1, "red", "black" ) )
> dev.off()

> write.csv(etp$table, "DESeq-wt-vs-mut.csv")

Note that the "FC" fold change is reversed for the output here. It is wt relative to mut. (WHich is why we put a negative in the graphing function.)

Exercises

  • Compare the expression changes predicted by DESeq and edgeR to each other.
  • Does edgeR or DESeq predict more significant changes?

Additional Points

  • In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit or FAR before aligning.
  • You can get a lot more information from RNAseq data than you could from a microarray experiment. You can map transcriptional start sites, areas of unexpected transcription, splice sites, etc. - all because you have full sequence information that we have barely used in this example.