Please add this line to your ~/.profile_user file:
. /corral-repl/utexas/BioITeam/bin/profile_ngs_course.bash |
This will load a configuration file that we've created that sets up your environment in some nice ways...
In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and mutant strains of the bacterium Listeria monocytogenes.
Copy the data files for this example into your $SCRATCH space:
cds cp -r /corral-repl/utexas/BioITeam/ngs_course/listeria_RNA_seq/data listeria_RNA_seq |
File Name |
Description |
Sample |
|---|---|---|
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
ΔsigB mutant, biological replicate 1 |
|
Single-end Illumina 36-bp reads |
wild-type, biological replicate 2 |
|
Single-end Illumina 36-bp reads |
ΔsigB mutant, biological replicate 2 |
|
Reference Genome sequence (FASTA) |
Listeria monocytogenes strain 10403S |
|
Reference Genome features (GFF) |
Listeria monocytogenes strain 10403S |
This data was submitted in the Short Read Archive to accompany this paper:
You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753
Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is likely to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we were using, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$ to > when you are in the R shell, to help clue you in to this fact.
R is the favorite language of pirates.
R is a very common scripting language used in statistics. There are whole courses on using R going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA ...). R also has advanced functionality for producing plots and graphs as output. We'll take advantage of all of this here. You are well on your way to becoming denizens of the polyglot bioinformatics community now.
Regrettably, R is a bit of it's own bizarro world, as far as how its commands work. (Futhermore, Googling "R" to get help can be very frustrating.) The conventions of most other programming and scripting languages seem to have been re-invented by someone who wanted to do everything their own way in R. Just like we wrote shell scripts in bash, you can write R scripts that carry out complicated analyses.
Basic rules of R:
q() to quit.?command. Try ?read.table. The q key gets you out of help, just like for a man page.\<- (less-than-dash) is the same as an equals sign \=. You can use them interchangeably.x <- 10 + 5 + 6 x |
Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can do things for you like convert sequences, but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data.
Here's how you install two modules that we will need for this exercise:
login1$ module load R
login1$ R
R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
Natural language support but running in an English locale
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
> source("http://bioconductor.org/biocLite.R")
...
> biocLite("DESeq")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n
|
When you start R later, you can load these modules with just these commands:
login1$ R
> library("DESeq")
> library("edgeR")
|
These commands will work for any Bioconductor modules!
For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)
We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson. Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!
You will need to first build the index file, just once and in "interactive mode" is fine. Then, you will need to submit a commands file with four lines to the TACC queue.
Please give the final output files the names: SRR034450.sam, SRR034451.sam, SRR034452.sam, SRR034453.sam.
|
Remember, |
Now create a commands file that looks like this:
Create the launcher script:
Run it:
|
Edit your commands file so that you convert all of these files from SAM to sorted and indexed BAM.
Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with ;.
Note the use of the variable $FILE, which means that is the only part of the line that we have to change. This is a mini-use of shell scripting.
FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam FILE=SRR034451 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam FILE=SRR034452 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam FILE=SRR034453 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam |
Re-create the launcher script and submit this new job to the queue.:
module load samtools launcher_creator.py -n samtools -e you@somewhere.com qsub launcher.sge |
bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.
We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:
module load bedtools bedtools bedtools multicov |
The multicov command takes a feature file (GFF) and counts how many reads in. By default it counts how many reads overlap the feature on either strand, but it can be made specific with an option.
Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.
grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_017544.1.gff > NC_017544.1.genes.gff |
What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use head to see the before and after.
head -n 50 NC_017544.1.gff head -n 50 NC_017544.1.genes.gff |
In order to use the bedtools command on our data, submit this commands file to the TACC queue:
bedtools multicov -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff head gene_counts.gff |
Our data that is cluttered with a lot of extra columns and one column stuffed with tag\=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files.
We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl script that would check to be sure that each line had the locus_tag (they do), among other things.
head gene_counts.gff sed -e 's/^.*locus_tag=//' < gene_counts.gff > gene_counts.tab head gene_counts.tab |
Now, we are going to load the GFF file straight into R, remove the columns we don't want, name the columns and rows like R expects, and write out this file. You could do this in any other scripting language, or even Excel. We will write out the first few lines of the file at each step, so that you can see what the command is doing.
Do not copy the > characters in the below example. They indicate are the R prompt to remind you which commands you are running the inside of R!
The commands for this example are also described in the DESeq vignette (PDF) .
login1$ R
...
> library("DESeq")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> head(counts)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> my.design <- data.frame(
row.names = colnames( counts ),
condition = c( "wt", "mut", "wt", "mut"),
libType = c( "single-end", "single-end", "single-end", "single-end" )
)
> conds <- factor(my.design$condition)
> cds <- newCountDataSet( counts, conds )
> cds
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
> cds <- estimateDispersions( cds )
> pdf("DESeq-dispersion_estimates.pdf")
> plot(
rowMeans( counts( cds, normalized=TRUE ) ),
fitInfo(cds)$perGeneDispEsts,
pch = '.', log="xy" )
xg <- 10^seq( -.5, 5, length.out=300 )
lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )
)
> dev.off()
> result <- nbinomTest( cds, "wt", "mut" )
> head(result)
> result = result[order(result$pval), ]
> head(result)
> write.csv(result, "DESeq-wt-vs-mut.csv")
> pdf("DESeq-MA-plot.pdf")
> plot(
result$baseMean,
result$log2FoldChange,
log="x", pch=20, cex=.3,
col = ifelse( result$padj < .1, "red", "black" ) )
> dev.off()
> q()
Save workspace image? [y/n/c]: n
login1$ head wt-vs-mut.csv
|
wt-vs-mut.csv is a comma-delimited file that could be reloaded into R or viewed in Excel.
You should copy the two *.pdf files that were created back to your Desktop to view them.
sizeFactors( cds )?
|
They are roughly speaking the relative average coverage of each data set? There are roughly 5 times as many counts of reads in genes for wt2 as there are for mut2. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file. |
|
The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). The program fits a model where the lower the counts are the more dispersion is expected (red line in graph), and thus the less significant a change in counts becomes. |
edgeR is another R package that you can use to do a similar analysis.
These commands use the negative binomial model, calculate the false discovery rate (FDR ~ adjusted p-value), and make a plot similar to the one from DESeq.
login1$ R
...
> library("edgeR")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> group <- factor(c("wt","mut","wt","mut"))
> dge = DGEList(counts=counts,group=group)
> dge <- estimateCommonDisp(dge)
> dge <- estimateTagwiseDisp(dge)
> et <- exactTest(dge)
> etp <- topTags(et, n=100000)
> etp$table$logFC = -etp$table$logFC
> pdf("edgeR-MA-plot.pdf")
> plot(
etp$table$logCPM,
etp$table$logFC,
xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
col = ifelse( etp$table$FDR < .1, "red", "black" ) )
> dev.off()
> write.csv(etp$table, "edgeR-wt-vs-mut.csv")
|
Note that the "FC" fold change calculated is initially the reverse of that for the DESeq example for the output here. It is wt relative to mut. To fix this, we put a negative in there for the log fold change.