Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

In this exercise, we will analyze RNA-seq data to measure changes in gene expression levels between wild-type and a mutant strains strain of the bacterium Listeria monocytogenes.

Learning Objectives

  • Review mapping reads with an example of how to use qsub to map many data sets in parallel on TACC.
  • Review samtools and SAM/BAM conversion.
  • How to use bedtools/HTseq to count reads overlapping genes.
  • Basic use of the R shell and installing BioConductor modules.
  • Use edgeR/DESeq to perform statistical analyses of differential gene expression.

Table of Contents

Table of Contents

Preliminary

Download data files

Copy the data files for this example into your $SCRATCH space:

...

This data was submitted to the Short Sequence Read Archive (SRA) to accompany this paper:

...

You can view the data in the ENA SRA here: http://www.ebi.ac.uk/ena/data/view/SRP001753

If you want to skip the read alignment step...

To get right to the new stuff, you can copy the mapped read BAM files and the reference sequence files that you will need using these commands:

...

Then, skip the mapping and SAM/BAM conversion, sorting, indexing steps below.

...

Install Bioconductor modules for R

Many of the modules for doing statistical tests on NGS data have been written in the "R" language for statistical computing. If you're not familiar with R, then this section is probably going to be a bit confusing. (You might be thinking "Stop with the new languages already guys! Uncle!") To orient you, we are going to run the R command, which launches the R shell inside our terminal. Like the bash shell that we normally use, the R shell interprets commands, but now they are R commands rather than bash commands. The prompt changes from login1$ to > when you are in the R shell, to help clue you in to this fact. The R shell is inside the bash shell. So when you quit R, you will be back where you were in the bash shell.

...

These commands will work for any Bioconductor module!

...

Create BAM file of mapped reads

Map reads using Bowtie

For RNA-seq analysis we're mainly counting the reads that align well, so we choose to use bowtie. (You could also use BWA or many other mappers.)

...

Expand
Just give me the answer...
Just give me the answer...
Code Block
module load bowtie
bowtie-build NC_017544.1.fasta NC_017544.1

Now create a commands file that looks like this:

Code Block
bowtie -p 3 -S NC_017544.1 SRR034450.fastq -S SRR034450.sam
bowtie -p 3 -S NC_017544.1 SRR034451.fastq -S SRR034451.sam
bowtie -p 3 -S NC_017544.1 SRR034452.fastq -S SRR034452.sam
bowtie -p 3 -S NC_017544.1 SRR034453.fastq -S SRR034453.sam

Create the launcher script and run it:

Code Block
module load python
launcher_creator.py -n bowtie -q development -c commands -t 0:30:00
qsub launcher.sge

Convert alignments to BAM

Edit your commands file so that you convert all of these files from SAM to sorted and indexed BAM.

...

Code Block
module load samtools
launcher_creator.py -n samtools -e you@somewhere.com
qsub launcher.sge

Optional Exercise

  • Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.

Count reads mapping to genes

...

bedtools

bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

...

Code Block
bedtools multicov -s -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff
head gene_counts.gff

...

Optional: HTseq

HTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools.

Code Block
grep "^NC_017544" NC_017544.1.gff > count_ref.gff
samtools view SRR034450.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count1.gff
samtools view SRR034451.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff
samtools view SRR034452.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff
samtools view SRR034453.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff

Analyze differential gene expression

...

DESeq

...

...

You should copy the two *.pdf files that were created back to your local computer to view them.

Exercises

  • What are the numbers returned by sizeFactors( cds )?
    Expand
    Answer...
    Answer...

    They are, roughly speaking, the relative average coverage of each data set. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file.

  • What are the dispersion estimates?
    Expand
    Answer...
    Answer...

    The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). In this model, the lower the counts are, the more dispersion relative to the mean is expected (red line in graph). Thus, higher fold changes are required in lowly expressed genes to call the same observed fold-change difference as significant.

  • What was the predominant effect of the mutation on gene expression in this Listeria strain?

...

Optional: edgeR

edgeR is another R package that you can use to do a similar analysis.

...

  • In an actual RNAseq analysis, you might want to trim stray adaptor sequences from your data using a tool like the FASTX-Toolkit, FAR, or cutadapt before aligning.
  • You can get a lot more information from RNAseq data than you could from a microarray experiment. You can map transcriptional start sites, areas of unexpected transcription, splice sites, etc. - all because you have full sequence information that we have barely used in this example.

From here...

  • Visualize mapped reads in BAM files using IGV to manually check some of the gene counts.
  • Look at the more sophisticated "Tuxedo" suite of RNAseq tools, which performs many functions that are especially useful in Eukaryotic genomes.