Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

File Name

Description

Sample

SRR034450.fastq

Single-end Illumina 36-bp reads

wild-type, biological replicate 1

SRR034451.fastq

Single-end Illumina 36-bp reads

Δ ?sigB mutant, biological replicate 1

SRR034452.fastq

Single-end Illumina 36-bp reads

wild-type, biological replicate 2

SRR034453.fastq

Single-end Illumina 36-bp reads

Δ ?sigB mutant, biological replicate 2

NC_017544.1.fasta

Reference Genome sequence (FASTA)

Listeria monocytogenes strain 10403S

NC_017544.1.gff

Reference Genome features (GFF)

Listeria monocytogenes strain 10403S

...

Code Block
cds
cp -r $BI/ngs_course/listeria_RNA_seq/mapped_data listeria_RNA_seq

Then, skip over the mapping and SAM/BAM conversion, sorting, indexing steps #Create BAM file of mapped reads section below.

Using the R environment for statistical computing

...

Warning
titleDo not copy the > characters in the R examples.

They are the R prompt to remind you which commands are to be run inside the R shell!

...

Hints for working with R

  • Don't forget: it's q() to quit.
  • For help, type ?command. Try ?read.table. The q key gets you out of help, just like for a man page.
  • The left arrow <- (less-than-dash) is the same as an equals sign =. You can use them interchangeably.
  • The prompt we will sometimes be showing for R is >. Don't type this for a command. It is like the login1$ at the beginning of the bash prompt when you log in to Lonestar. It just means that you are in the R shell.
  • You can type the name of a variable to have its value displayed. Like this...
    Code Block
    > x <- 10 + 5 + 6
    > x
    [1] 21
    

...

Warning

The install commands may take several minutes to complete. You can read ahead while they run or even open a new terminal window and connect it to Lonestar and continue onward in the tutorial as you wait for R.

Code Block
titleStarting R and loading the modules for this tutorial
login1$ module load R
login1$ R

R version 2.1415.03 (20112013-1003-3101) -- "Security Blanket"
Copyright (C) 20112013 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> source("http://bioconductor.org/biocLite.R")
...Warning >in biocLiteinstall.packages("DESeqBiocInstaller")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n

When you start R later, you will not need to re-intall the modules. You can load them with just these commands:

Code Block

login1$ R
> library("DESeq")
> library("edgeR")

These commands will work for any Bioconductor module!

Create BAM file of mapped reads

Map reads using Bowtie

For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)

We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson.

Warning

Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!

You will need to first build the index file, just once and in "interactive mode" is fine (it's fast, so you don't need an idev shell). Then, you will need to submit a commands file with four lines to the TACC queue.

Please give the final output files the names: SRR034450.sam, SRR034451.sam, SRR034452.sam, SRR034453.sam.

...

Remember, bowtie-build once then bowtie for each separate sample.

...

Code Block

module load bowtie
bowtie-build NC_017544.1.fasta NC_017544.1

Now create a commands file that looks like this:

Code Block

bowtie -p 3 -S NC_017544.1 SRR034450.fastq -S SRR034450.sam
bowtie -p 3 -S NC_017544.1 SRR034451.fastq -S SRR034451.sam
bowtie -p 3 -S NC_017544.1 SRR034452.fastq -S SRR034452.sam
bowtie -p 3 -S NC_017544.1 SRR034453.fastq -S SRR034453.sam

Create the launcher script and run it:

Code Block

module load python
launcher_creator.py -n bowtie -q development -c commands -t 0:30:00
qsub launcher.sge

Convert alignments to BAM

Edit your commands file so that you convert all of these files from SAM to sorted and indexed BAM.

Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with &&.

Note the use of the variable $FILE, which means that is the only part of the line that we have to change. This is a mini-use of shell scripting.

Code Block

FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034451 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034452, repos = a["BioCsoft", "URL"]) :
  'lib = "/opt/apps/R/2.15.3/lib64/R/library"' is not writable
Would you like to use a personal library instead?  (y/n) y
Would you like to create a personal library
~/R/x86_64-unknown-linux-gnu-library/2.15
to install packages into?  (y/n) y
...
> biocLite("DESeq")
...
> biocLite("edgeR")
...
> q()
Save workspace image? [y/n/c]: n

When you start R later, you will not need to re-install the modules. You can load them with just these commands:

Code Block
titleStarting R and loading modules after they are installed

login1$ R
> library("DESeq")
> library("edgeR")

These commands will work for any Bioconductor module!

Create BAM file of mapped reads

Anchor
Create BAM file of mapped reads
Create BAM file of mapped reads

Map reads using Bowtie

For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)

We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson.

Warning

Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!

You will need to first build the index file, just once and in "interactive mode" is fine (it's fast, so you don't need an idev shell). Then, you will need to submit a commands file with four lines to the TACC queue using qsub.

Please give the final output files the names: SRR034450.sam, SRR034451.sam, SRR034452.sam, SRR034453.sam.

Expand
I just want a little hint
I just want a little hint

Remember, bowtie-build once then bowtie for each separate sample.

Expand
Please take me through all of the steps...
Please take me through all of the steps...
Code Block

module load bowtie
bowtie-build NC_017544.1.fasta NC_017544.1

Now create a bowtie_commands file that looks like this using nano or another text editor:

Code Block

bowtie -p 3 -S NC_017544.1 SRR034450.fastq SRR034450.sam
bowtie -p 3 -S NC_017544.1 SRR034451.fastq SRR034451.sam
bowtie -p 3 -S NC_017544.1 SRR034452.fastq SRR034452.sam
bowtie -p 3 -S NC_017544.1 SRR034453.fastq SRR034453.sam

Remember that there are 12 processors per node on Lonestar, so we choose to use 3 for each of the 4 jobs with the -p 3 option.

Create the launcher script and run it:

Warning

Remember that you cannot qsub from within an idev shell!

Code Block

module load python
launcher_creator.py -n bowtie -q development -j bowtie_commands -t 0:30:00
qsub bowtie.sge

Convert Bowtie output to BAM

Create a new samtools_commands file so that you convert all of these files from SAM to sorted and indexed BAM all at one time by using qsub.

Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with &&.

Note the use of the variable $FILE, by which we set a variable's value int he first part of the line and use it over and over in the latter part of the line. This is a mini use of shell scripting.

Code Block
titleContents of samtools_commands file

FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034453SRR034451 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam

Re-create the launcher script and submit this new job to the queue. Be sure you have samtools loaded as the node that your job launches on will inherit your current environment, including whatever modules you have loaded:

Code Block

module load samtools
launcher_creator.py -n samtools -e you@somewhere.com
qsub launcher.sge

Optional Exercise

  • Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.

Count reads mapping to genes

bedtools

bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

Code Block

module load bedtools
bedtools
bedtools multicov

The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s option. Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if gff file contains chr1, chrX etc, the GFF file must also call the chromosomes as chr1, chrX and so on.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.

Code Block

grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_017544.1.gff > NC_017544.1.genes.gff

What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use head to see the before and after.

Code Block

head -n 50 NC_017544.1.gff
head -n 50FILE=SRR034452 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034453 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam

Create a new launcher script and submit this new job to the queue. Be sure you have samtools loaded as the node that your job launches on will inherit your current environment, including whatever modules you have loaded:

Code Block

module load samtools
Expand
I'd like to see the commands for this qsub...
I'd like to see the commands for this qsub...
Code Block

launcher_creator.py -n samtools -q development -j samtools_commands -t 0:30:00
qsub samtools.sge

Optional Exercise

  • Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.

Count reads mapping to genes

bedtools

bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

Code Block

module load bedtools
bedtools
bedtools multicov

The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s option. Unfortunately, this option only exists for the multicov command in a version of bedtools that is newer than the module on TACC, so we don't include it in the example command below.

Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if BAM file used for mapping contains chr1, chrX etc, the GFF file must also call the chromosomes as chr1, chrX and so on.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.

Code Block
titleThis is all one line...

grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_017544.1.gff > NC_017544.1.genes.gff

In order to use the bedtools command on our data, submit this commands file to the TACC queue:What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use head to see the before and after.

Code Block
bedtoolshead multicov -sn -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed50 NC_017544.1.gff
head -n 50 NC_017544.1.genes.gff > gene_counts.gff
head gene_counts.gff

Optional: HTseq

HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools.

Code Block

grep "^NC_017544" NC_017544.1.gff > count_ref

In order to use the bedtools command on our data, do the following:

Warning
Submit to the TACC queue or run in an idev shell
Submit to the TACC queue or run in an idev shell
Code Block

bedtools multicov -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff

Then take a peek at the data...

Code Block

head gene_counts.gff

Optional: HTseq

HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools.

Code Block

grep "^NC_017544" NC_017544.1.gff > count_ref.gff
samtools view SRR034450.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count1.gff
samtools view SRR034451.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff
samtools view SRR034452.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff
samtools view SRR034450SRR034453.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff

join count1.gff count2.gff | join - count3.gff | join - count_refcount4.gff > count1gene_counts_HTseq.gff
samtools#if you viewhave SRR034451.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff
samtools view SRR034452.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff
samtools view SRR034453.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff
many samples, use for-loop and join

gene_counts_HTseq.gff has 5 more lines than gene_counts.gff. Check out the last 5 lines. They are basic statistics.

Code Block

wc -l gene_counts_HTseq.gff
tail gene_counts_HTseq.gff

The basic statistics (last 5 lines) is useful to know, but should be removed to use it as a input file for DEGseq

Code Block

head -2910 gene_counts_HTseq.gff > gene_counts_HTseq.tab

Finally, gene_counts_HTseq.tab is ready to use. HTseq-count is strand-specific in default. Therefore, read counts for each gene in gene_counts_HTseq.gff are approximately a half counts in gene_counts.gff for the corresponding gene.

Analyze differential gene expression

...

You should copy the two *.pdf files that were created back to your local computer to view them .using scp

Exercises

  • What are the numbers returned by sizeFactors( cds )?
    Expand
    Answer...
    Answer...

    They are, roughly speaking, the relative average coverage of each data set. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file.

  • What are the dispersion estimates?
    Expand
    Answer...
    Answer...

    The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). In this model, the lower the counts are, the more dispersion relative to the mean is expected (red line in graph). Thus, higher fold changes are required in lowly expressed genes to call the same observed fold-change difference as significant.

  • What was the predominant effect of the mutation on gene expression in this Listeria strain?

...

Code Block
titleUsing edgeR
login1$ R
...
> library("edgeR")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> group <- factor(c("wt","mut","wt","mut"))
> dge = DGEList(counts=counts,group=group)
> dge <- estimateCommonDisp(dge)
> dge <- estimateTagwiseDisp(dge)
> et <- exactTest(dge)
> etp <- topTags(et, n=100000)
> etp$table$logFC = -etp$table$logFC

> pdf("edgeR-MA-plot.pdf")
> plot(
  etp$table$logCPM,
  etp$table$logFC,
  xlim=c(-3, 20), ylim=c(-12, 12), pch=20, cex=.3,
  col = ifelse( etp$table$FDR < .1, "red", "black" ) )
> dev.off()

> write.csv(etp$table, ".off()

> write.csv(etp$table, "edgeR-wt-vs-mut.csv")
> q()
Save workspace image? [y/n/c]: n
login1$ head edgeR-wt-vs-mut.csv")

Note that the "FC" fold change calculated is initially the reverse of that for the DESeq example for the output here. It is wt relative to mut. To fix this, we put a negative in there for the log fold change.

...

Comparison

  • Compare the expression changes predicted by DESeq and edgeR to each other.
  • Does edgeR or DESeq predict more significant changes?

...

  • In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit, FAR, or cutadapt before aligningthe tools discussed in Evaluating your raw sequencing data.
  • You can get a lot more information from RNAseq data than you could from a microarray experiment. You can map transcriptional start sites, areas of unexpected transcription, splice sites, etc. - all because you have full sequence information that we have barely used in this example.
  • You can call variants from mapped RNAseq data, just be aware that many regions will have no coverage (because they are not expressed as RNA).

From here...

  • Visualize mapped reads in BAM files using IGV to manually check some of the gene counts.
  • Look at the more sophisticated "Tuxedo" suite of RNAseq tools, which performs many functions that are especially useful in Eukaryotic genomes.