View Source

In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and mutant strains of Listeria.

Download data files

Copy the data files for this example into your $SCRATCH space:

cds
cp -r /corral-repl/utexas/BioITeam/ngs_course/listeria_RNA_seq/data listeria_RNA_seq

File Name	Description	Sample
`SRR034450.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 1
`SRR034451.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 1
`SRR034452.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 2
`SRR034453.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 2
`NC_017544.1.gbk`	Reference Genome	Listeria monocytogenes strain 10403S

This data was submitted in the Short Read Archive to accompany this paper:

Oliver, H.F., et al. (2009) Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs. BMC Genomics 10:641. Pubmed

You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753

Installing Bioconductor modules for R

Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is likely to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we were using, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$ to > when you are in the R shell, to help clue you in to this fact.

R is the favorite language of pirates. R is a very common scripting language used in statistics. There are whole classes on it going on in other SSI classrooms as we speak! Inside the R universe, you have access to an incredibly large number of useful statistical functions (Fisher's exact test, nonlinear least-squares fitting, ANOVA ...). R also has advanced functionality for producing plots and graphs as output. We'll take advantage of all of this here. You are well on your way to becoming denizens of the polyglot bioinformatics community now.

Regrettably, R is a bit of it's own bizarro world, as far as how its commands work. (Futhermore, Googling "R" to get help can be very frustrating.) The conventions of most other programming and scripting languages seem to have been re-invented by someone who wanted to do everything their own way in R. Just like we wrote shell scripts in bash, you can write R scripts that carry out complicated analyses.

Basic rules of R:

Don't forget: it's q() to quit.
For help, type ?command. Try ?read.table
The left arrow \<- (less-than-dash) is the same as an equals sign \=.

Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can do things for you like convert sequences, but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data.

Here's how you install two modules that we will need for this exercise:

login1$ module load R
login1$ R

R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("http://bioconductor.org/biocLite.R")
...
> biocLite("DESeq")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n

When you start R later, you can load these modules with just these commands:

login1$ R
> library("DESeq")
> library("edgeR")

These commands will work for any Bioconductor modules!

Align the reads

For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers).

We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson. You will need to first build the index file, just once and in "interactive mode" is fine. Then, you will need to submit a commands file with four lines to the TACC queue.

Please give the final output files the names: SRR034450.sam, SRR034451.sam, SRR034452.sam, SRR034453.sam.

Remember, bowtie-build once then bowtie for each separate sample.

module load bowtie
bowtie-build

Now create a commands file that looks like this:

---+++ Align reads to reference genome

---++++ Using bowtie2

First, index your genome so bowtie2 can map read to it: %BR%
<code>$bowtie2-build REL606.fna REL606</code> %BR%

Then, align each data set: %BR%
<code>$bowtie2 -x REL606 -U datasetX.fastq --phred33 -S REL606.sam</code> %BR%

Optionally, add the <code>--local</code> flag if your reads do not map end-to-end.

---++++ Using BWA
First, index your genome so BWA can map read to it: %BR%
<code>$bwa index REL606.fna</code> %BR%

Then, align each data set: %BR%
<code>$bwa aln REL606.fna dataset1.fastq > datasetX.sai </code> %BR%

And convert to SAM format (assumes single-end data):
<code>$bwa samse REL606.fna datasetX.sai datasetX.fastq > datasetX.sam </code> %BR%

---++ Count reads mapping to genes

<code>breseq RNASEQ -f REL606.fna -r REL606.gbk -o datasetX.count.tab datasetX.sam</code> %BR%

---++ Convert alignments to BAM

And convert to BAM format (assumes single-end data):

login1$ samtools faidx REL606.fna
login1$ samtools import REL606.fna datasetX.sam datasetX.unsorted.bam
login1$ samtools sort datasetX.unsorted.bam datasetX
login1$ samtools index datasetX.bam

Exercise: Is this a strand-specific RNA-seq library? Try using IGV to view some of the data.

hg2. Analyze differential gene expression

Use bedtools to count reads in features.
Converting mapped reads to feature counts.

hg3. Using DESeq

[http://bioconductor.org/packages/devel/bioc/html/DESeq.html Manual and Instructions]

login1$ R
...
>source("http://bioconductor.org/biocLite.R")
>biocLite("DESeq")

login1$ R
...
> library("DESeq")
> combined = read.csv("combined.csv", header=T, row.names=1)
> design <- data.frame(
  row.names = colnames( combined ),
  condition = c( "Anc", "Anc", "EL", "EL", "EW", "EW"),
  libType = c( "single-end", "single-end", "single-end",
  "single-end", "single-end", "single-end" ) )
> design
> conds <- factor(design$condition)
> cds <- newCountDataSet( combined, conds )
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
> res <- nbinomTest( cds, "EL", "EW" )
> write.csv(res, "EL-vs-EW.csv")

Additional Exercises

In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit or FAR before aligning.