Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
bedtools multicov -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff

...

head gene_counts.gff

Analyze differential gene expression (DESeq)

Our data that is cluttered with a lot of extra columns and one column stuffed with tag\=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files.

We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl script that would check to be sure that each line had the locus_tag (they do)

...

, among other things.

Code Block
titleReformatting gene_counts.gff

head gene_counts.gff
sed -e 's/^.*locus_tag=//' < gene_counts.gff > gene_counts.tab
head gene_counts.tab

Now, we are going to load the GFF file straight into R, remove the columns we don't want, name the columns and rows like R expects, and write out this file. You could do this in any other scripting language, or even Excel. We will write out the first few lines of the file at each step, so that you can see what the command is doing.

vignette("DEseq")

Code Block
titleUsing DESeq
login1$ R
...
> library("DESeq")
> combinedcounts = read.csvtable("combinedgene_counts.csvtab", header=TF, row.names=1, sep="\t")
> head(counts)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> my.design <- data.frame(
  row.names = colnames( combinedcounts ),
  condition = c( "Ancwt", "Ancmut", "ELwt", "EL", "EW", "EW"mut"),
  libType = c( "single-end", "single-end", "single-end",
  "single-end", "single-end", "single-end" ) 
)
)
> design
> conds <- factor(my.design$condition)
> cds <- newCountDataSet( combinedcounts, conds )
> summary(cds)
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
> cds <- estimateDispersions( cds )

> pdf("dispersion_estimates.pdf")
> plotDispEsts( cds )
> res dev.off()

> result <- nbinomTest( cds, "ELwt", "EW" )
mut" )
> head(result)
> result = result[order(result$pval), ]

> write.csv(resresult, "ELwt-vs-EWmut.csv")

> pdf("MA-plot")
> plot(
  result$baseMean,
  result$log2FoldChange,
  log="x", pch=20, cex=.3,
  col = ifelse( result$padj < .1, "red", "black" ) )
> dev.off()

> q()
Save workspace image? [y/n/c]: n
login1$ head wt-vs-mut.csv

Additional Points

  • In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit or FAR before aligning.