Content Comparison

In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and mutant strains of the bacterium Listeria monocytogenes.

Table of Contents

Download data files

Copy the data files for this example into your $SCRATCH space:

...

Expand

	Just give me the answer...
	Just give me the answer...

Code Block
module load bowtie2 bowtie2-build NC_017544.1.fasta NC_017544.1

Now create a commands file that looks like this:

Code Block

bowtie2 -p 3 -x NC_017544.1 -U SRR034450.fastq -S SRR034450.sam
bowtie2 -p 3 -x NC_017544.1 -U SRR034451.fastq -S SRR034451.sam
bowtie2 -p 3 -x NC_017544.1 -U SRR034452.fastq -S SRR034452.sam
bowtie2 -p 3 -x NC_017544.1 -U SRR034453.fastq -S SRR034453.sam

Create the launcher script and run it:

Code Block
module load python launcher_creator.py -n bowtie -e you@somewhere.com -t 0:30:00 qsub launcher.sge h2.

Convert

...

alignments

...

to

...

BAM

...

Edit

...

your

...

commands

...

file

...

so

...

that

...

you

...

convert

...

all

...

of

...

these

...

files

...

from

...

SAM

...

to

...

sorted

...

and

...

indexed

...

BAM.

...

Linux

...

expert

...

tip:

...

you

...

can

...

string

...

together

...

commands

...

all

...

on

...

one

...

line,

...

so

...

that

...

they

...

are

...

sent

...

to

...

the

...

same

...

core

...

one

...

after

...

another

...

by

...

separating

...

them

...

on

...

the

...

line

...

with

...

&&

...

.

...

Note

...

the

...

use

...

of

...

the

...

variable

...

$FILE,

...

which

...

means

...

that

...

is

...

the

...

only

...

part

...

of

...

the

...

line

...

that

...

we

...

have

...

to

...

change.

...

This

...

is

...

a

...

mini-use

...

of

...

shell

...

scripting.

Code Block

FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam


FILE=SRR034451 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam


FILE=SRR034452 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam


FILE=SRR034453 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam

code

Re-create

...

the

...

launcher

...

script

...

and

...

submit

...

this

...

new

...

job

...

to

...

the

...

queue.

...

Be

...

sure

...

you

...

have

...

samtools

...

loaded

...

as

...

the

...

node

...

that

...

your

...

job

...

launches

...

on

...

will

...

inherit

...

your

...

current

...

environment,

...

including

...

whatever

...

modules

...

you

...

have

...

loaded:

h4.

Code Block

module load samtools


launcher_creator.py -n samtools -e you@somewhere.com


qsub launcher.sge

Code Block

Optional

...

Exercise

...

Is
...
this
...
a
...
strand-specific
...
RNA-seq
...
library?
...
Try
...
using
...
IGV
...
to
...
view
...
some
...
of
...
the
...
BAM
...
file
...
data
...
and
...
examine
...
the
...
reads
...
mapped
...
to
...
each
...
gene.
...

Count

...

reads

...

mapping

...

to

...

genes

...

using

...

bedtools

...

module load bedtools
bedtools
bedtools multicov

Code Block



The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with an option.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using {{grep}}.

...

bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

Code Block
module load bedtools bedtools bedtools multicov

The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with an option.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.

Code Block


grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_017544.1.gff > NC_017544.1.genes.gff

code

What

...

is

...

this

...

doing?

...

It's

...

taking

...

all

...

the

...

lines

...

that

...

begin

...

with

...

(^),

...

then

...

"NC_017544",

...

then

...

any

...

number

...

of

...

spaces

...

or

...

tabs,

...

then

...

"GenBank",

...

then

...

any

...

number

...

of

...

spaces

...

or

...

tabs,

...

then

...

"gene".

...

Use

...

head

...

to

...

see

...

the

...

before

...

and

...

after.

Code Block


head

head

-n 50 NC_017544.1.gff


head -n 50 NC_017544.1.genes.gff

code

In

...

order

...

to

...

use

...

the

...

bedtools

...

command

...

on

...

our

...

data,

...

submit

...

this

...

commands

...

file

...

to

...

the

...

TACC

...

queue:

h2.

Code Block

bedtools multicov -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff


head gene_counts.gff

Code Block

Analyze

...

differential

...

gene

...

expression

...

(DESeq)

...

DESeq Manual and Instructions

Our data that is cluttered with a lot of extra columns and one column stuffed with tag\=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files.

We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl or Python script that would check to be sure that each line had the locus_tag (they do), among other things.

Code Block

title	Reformatting gene_counts.gff


head gene_counts.gff
sed 's/^.*locus_tag=//' gene_counts.gff > gene_counts.tab

After it has run, take a peek at the new file:

Code Block
head gene_counts.tab

Warning
Do not copy the `>` characters in the examples. They are the R prompt to remind you which commands you are running the inside of R!*

The commands for this example are also described in the DESeq vignette (PDF) .

Code Block

title	Using DESeq


login1$ R
...
> library("DESeq")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> head(counts)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> my.design <- data.frame(
  row.names = colnames( counts ),
  condition = c( "wt", "mut", "wt", "mut"),
  libType = c( "single-end", "single-end", "single-end", "single-end" ) 
)
> conds <- factor(my.design$condition)

> cds <- newCountDataSet( counts, conds )
> cds

> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )

> cds <- estimateDispersions( cds )

> pdf("DESeq-dispersion_estimates.pdf")
> plot(
  rowMeans( counts( cds, normalized=TRUE ) ),
  fitInfo(cds)$perGeneDispEsts,
  pch = '.', log="xy" )
  xg <- 10^seq( -.5, 5, length.out=300 )
  lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )
)
> dev.off()

> result <- nbinomTest( cds, "wt", "mut" )
> head(result)

> result = result[order(result$pval), ]
> head(result)

> write.csv(result, "DESeq-wt-vs-mut.csv")

> pdf("DESeq-MA-plot.pdf")
> plot(
  result$baseMean,
  result$log2FoldChange,
  log="x", pch=20, cex=.3,
  col = ifelse( result$padj < .1, "red", "black" ) )
> dev.off()

> q()
Save workspace image? [y/n/c]: n
login1$ head wt-vs-mut.csv

wt-vs-mut.csv is a comma-delimited file that could be reloaded into R or viewed in Excel.

You should copy the two *.pdf files that were created back to your Desktop to view them.

Exercises

What are the numbers returned by sizeFactors( cds )?

Expand

	Answer...
	Answer...

They are, roughly speaking, the relative average coverage of each data set. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file.

What are the dispersion estimates?

Expand

	Answer...
	Answer...

The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). In this model, the lower the counts are, the more dispersion relative to the mean is expected (red line in graph). Thus, higher fold changes are required in lowly expressed genes to call the same observed fold-change difference as significant.

What was the predominant effect of the mutation on gene expression in this Listeria strain?

...

Version	Old Version 38	New Version 39
Changes made by	Jeffrey E Barrick	Jeffrey E Barrick
Saved on	Aug 21, 2012	Aug 21, 2012

Versions Compared

Key

Download data files

Convert

alignments

to

BAM

Optional

Exercise

Is
...
this
...
a
...
strand-specific
...
RNA-seq
...
library?
...
Try
...
using
...
IGV
...
to
...
view
...
some
...
of
...
the
...
BAM
...
file
...
data
...
and
...
examine
...
the
...
reads
...
mapped
...
to
...
each
...
gene.
...

Count

reads

mapping

to

genes

using

bedtools

Analyze

differential

gene

expression

(DESeq)

Exercises

Content Comparison

Versions Compared

Key

Download data files

Convert

alignments

to

BAM

Optional

Exercise

Is ... this ... a ... strand-specific ... RNA-seq ... library? ... Try ... using ... IGV ... to ... view ... some ... of ... the ... BAM ... file ... data ... and ... examine ... the ... reads ... mapped ... to ... each ... gene....

Count

reads

mapping

to

genes

using

bedtools

Analyze

differential

gene

expression

(DESeq)

Exercises

Is
...
this
...
a
...
strand-specific
...
RNA-seq
...
library?
...
Try
...
using
...
IGV
...
to
...
view
...
some
...
of
...
the
...
BAM
...
file
...
data
...
and
...
examine
...
the
...
reads
...
mapped
...
to
...
each
...
gene.
...