Analyzing RNA-Seq data for differential gene expression

We will .

Download data files

File Name	Description	Sample
`SRR034450.fastq`	Single-end Illumina 36-bp reads	wild-type, replicate 1
`SRR034451.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, replicate 1
`SRR034452.fastq`	Single-end Illumina 36-bp reads	wild-type, replicate 2
`SRR034453.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, replicate 2
`NC_017544.1.gbk`	Reference Genome	Listeria monocytogenes strain 10403S

http://www.ebi.ac.uk/ena/data/view/SRP001753

http://www.biomedcentral.com/1471-2164/10/641

Deep RNA sequencing of L. monocytogenes reveals overlapping and extensive stationary phase and sigma B-dependent transcriptomes, including multiple highly transcribed noncoding RNAs

Data files
- RNA-Seq data in FASTQ format (Ex: dataset1.fastq, dataset2.fastq)
Genome sequence files
- Genome sequence in FASTA format (Ex: REL606.fna)
- Genome sequence gene annotations in GFF3 format (Ex: REL606.gff3)

Install software

Bowtie2 (first choice)

module load bowtie

Alternatively:

[http://bowtie-bio.sourceforge.net Download Bowtie]
To build, just type "make" in the source code directory.
Add this directory to your $PATH or move bowtie, and move bowtie-* star executables to your $PATH

Bioconductor modules for R statistics package

Download and install R

login1$ module load R
login1$ R

R version 2.14.0 (2011-10-31)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("http://bioconductor.org/biocLite.R")
> biocLite("DESeq")
...
> biocLite("edgeR")
...

When you load R subsequent times, you can load these libraries with just these commands:

library("DESeq")
library("edgeR")

library(edgeR)

library(DESEQ)

Commands

Remove adaptor sequences from reads

You will need a FASTA file of adaptor sequences.

[https://wikis.utexas.edu/display/GSAF/Illumina+-+all+flavors Illumina Adapters Sequences (GSAF)] | [%ATTACHURL%/gsaf_illumina_adapters.fasta Download FASTA]

For each input file you will need to run this command (single-end data): %BR%
<code>$far --source datasetX.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger</code>

There is an option to process paired-end data like this: %BR%
<code>$far --source datasetX_R1.fastq --source2 datasetX_R2.fastq --target datasetX.noadaptor --adaptive-overlap --trim-end any --adapters gsaf_illumina_adapters.fasta --format fastq-sanger</code>

---++++ Compile and install FAR on MacOSX

Unfortunately, FAR comes only with Windows and Linux binaries. To build FAR(2.0) for MacOSX:

1 Install Apple Developer Tools
1 Install cmake: $sudo port install cmake
1 Check out code: %BR% <code>$ svn co https://theflexibleadap.svn.sourceforge.net/svnroot/theflexibleadap theflexibleada</code>
1 Compile code: %BR% <code>$ cd theflexibleada</code> %BR% <code>$ cmake CMakeLists.txt </code>
1 Copy executable and library: %BR% <code>$ cp lib/libtbb.dylib ~/local/lib</code> %BR% <code>$ cp build/* ~/local/bin </code>
1 Add these locations to your path with lines in ~/.profile: %BR% <code>export DYLD_LIBRARY_PATH="$DYLD_LIBRARY_PATH:$HOME/local/lib" %BR% export PATH="$PATH:$HOME/local/bin" </code>

---+++ Align reads to reference genome

---++++ Using bowtie2

First, index your genome so bowtie2 can map read to it: %BR%
<code>$bowtie2-build REL606.fna REL606</code> %BR%

Then, align each data set: %BR%
<code>$bowtie2 -x REL606 -U datasetX.fastq --phred33 -S REL606.sam</code> %BR%

Optionally, add the <code>--local</code> flag if your reads do not map end-to-end.

---++++ Using BWA
First, index your genome so BWA can map read to it: %BR%
<code>$bwa index REL606.fna</code> %BR%

Then, align each data set: %BR%
<code>$bwa aln REL606.fna dataset1.fastq > datasetX.sai </code> %BR%

And convert to SAM format (assumes single-end data):
<code>$bwa samse REL606.fna datasetX.sai datasetX.fastq > datasetX.sam </code> %BR%

---++ Count reads mapping to genes

<code>breseq RNASEQ -f REL606.fna -r REL606.gbk -o datasetX.count.tab datasetX.sam</code> %BR%

---++ Convert alignments to BAM

And convert to BAM format (assumes single-end data): %BR%
<code>$samtools faidx REL606.fna </code> %BR%
<code>$samtools import REL606.fna datasetX.sam datasetX.unsorted.bam </code> %BR%
<code>$samtools sort datasetX.unsorted.bam datasetX </code> %BR%
<code>$samtools index datasetX.bam </code> %BR%

Now you can use IGV to view them.

hg2. Analyze differential gene expression

Use bedtools to count reads in features.
Converting mapped reads to feature counts.

The data files for this example are in the path:

/corral-repl/utexas/BioITeam/ngs_course/ecoli_rnaseq

File Name	Description
`SRR030252.fastq.gz`	Illumina reads, 0K generation individual clone from population
`SRR032374.fastq.gz`	Illumina reads, 20K generation mixed population
`SRR032376.fastq.gz`	Illumina reads, 40K generation mixed population
`NC_012967.1.fasta.gz`	E. coli B str. REL606 genome

hg3. Using DESeq

[http://bioconductor.org/packages/devel/bioc/html/DESeq.html Manual and Instructions]

login1$ R
...
>source("http://bioconductor.org/biocLite.R")
>biocLite("DESeq")

login1$ R
...
> library("DESeq")
> combined = read.csv("combined.csv", header=T, row.names=1)
> design <- data.frame(
  row.names = colnames( combined ),
  condition = c( "Anc", "Anc", "EL", "EL", "EW", "EW"),
  libType = c( "single-end", "single-end", "single-end",
  "single-end", "single-end", "single-end" ) )
> design
> conds <- factor(design$condition)
> cds <- newCountDataSet( combined, conds )
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
> res <- nbinomTest( cds, "EL", "EW" )
> write.csv(res, "EL-vs-EW.csv")