Content Comparison

...

File Name	Description	Sample
`SRR034450.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 1
`SRR034451.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 1
`SRR034452.fastq`	Single-end Illumina 36-bp reads	wild-type, biological replicate 2
`SRR034453.fastq`	Single-end Illumina 36-bp reads	ΔsigB mutant, biological replicate 2
`NC_017544.1.gbkfasta`	Reference Genome sequence (FASTA)	Listeria monocytogenes strain 10403S
`NC_017544.1.gff`	Reference Genome features (GFF)	Listeria monocytogenes strain 10403S

...

We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson. Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!

You will need to first build the index file, just once and in "interactive mode" is fine. Then, you will need to submit a commands file with four lines to the TACC queue.

...

Expand

	Just give me the answer...
	Just give me the answer...

Code Block
module load bowtie bowtie-build NC_017544.1.fasta NC_017544.1

Now create a commands file that looks like this:

Code Block

---+++ Align reads to reference genome

---++++ Using bowtie2

First, index your genome so bowtie2 can map read to it: %BR%
<code>$bowtie2-build REL606.fna REL606</code> %BR%

Then, align each data set: %BR%
<code>$bowtie2 -x REL606 -U datasetX.fastq --phred33 -S REL606.sam</code> %BR%

Optionally, add the <code>--local</code> flag if your reads do not map end-to-end.

---++++ Using BWA
First, index your genome so BWA can map read to it: %BR%
<code>$bwa index REL606.fna</code> %BR%

Then, align each data set: %BR%
<code>$bwa aln REL606.fna dataset1.fastq > datasetX.sai </code> %BR%

And convert to SAM format (assumes single-end data):
<code>$bwa samse REL606.fna datasetX.sai datasetX.fastq > datasetX.sam </code> %BR%

---++ Count reads mapping to genes

<code>breseq RNASEQ -f REL606.fna -r REL606.gbk -o datasetX.count.tab datasetX.sam</code> %BR%

---++ Convert alignments to BAM

And convert to BAM format (assumes single-end data):

Code Block
login1$ samtools faidx REL606.fna login1$ samtools import REL606.fna datasetX.sam datasetX.unsorted.bam login1$ samtools sort datasetX.unsorted.bam datasetX login1$ samtools index datasetX.bam

...


bowtie --sam NC_017544.1 SRR034450.fastq SRR034450.sam
bowtie --sam NC_017544.1 SRR034451.fastq SRR034451.sam
bowtie --sam NC_017544.1 SRR034452.fastq SRR034452.sam
bowtie --sam NC_017544.1 SRR034453.fastq SRR034453.sam

Create the launcher script:

Code Block
module load python launcher_creator.py -n bowtie -e you@somewhere.com

Run it:

Code Block
qsub launcher.sge

Convert alignments to BAM

Edit your commands file so that you convert all of these files from SAM to sorted and indexed BAM.

Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with ;.

Note the use of the variable $FILE, which means that is the only part of the line that we have to change. This is a mini-use of shell scripting.

Code Block


FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034451 ; samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam ; samtools sort $FILE.unsorted.bam $FILE ; samtools index $FILE.bam
FILE=SRR034452 ; samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam ; samtools sort $FILE.unsorted.bam $FILE ; samtools index $FILE.bam
FILE=SRR034453 ; samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam ; samtools sort $FILE.unsorted.bam $FILE ; samtools index $FILE.bam

Re-create the launcher script and submit this new job to the queue.:

Code Block
module load samtools launcher_creator.py -n samtools -e you@somewhere.com qsub launcher.sge

Optional Exercise

Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.

...

hg2. Analyze differential gene expression

Use bedtools to count reads in features.
Converting mapped reads to feature counts.

hg3. Using DESeq

[http://bioconductor.org/packages/devel/bioc/html/DESeq.html Manual and Instructions]

...

title	DESeq installation

...

Count reads mapping to genes using bedtools

bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

Code Block
module load bedtools bedtools bedtools multicov

The multicov command takes a feature file (GFF) and counts how many reads in. By default it counts how many reads overlap the feature on either strand, but it can be made specific with an option.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.

Code Block
grep '^NC_017544[[:space:]]GenBank[[:space:]]gene' NC_017544.1.gff > NC_017544.1.genes.gff

What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use head to see the before and after.

Code Block
head -n 50 NC_017544.1.gff head -n 50 NC_017544.1.genes.gff

In order to use the bedtools command on our data, submit this commands file to the TACC queue:

Code Block
bedtools multicov -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff

hg2. Analyze differential gene expression (DESeq)

[http://bioconductor.org

...

/packages/devel/bioc/html/DESeq.html Manual and Instructions]

Code Block

title	Using DESeq

login1$ R
...
> library("DESeq")
> combined = read.csv("combined.csv", header=T, row.names=1)
> design <- data.frame(
  row.names = colnames( combined ),
  condition = c( "Anc", "Anc", "EL", "EL", "EW", "EW"),
  libType = c( "single-end", "single-end", "single-end",
  "single-end", "single-end", "single-end" ) )
> design
> conds <- factor(design$condition)
> cds <- newCountDataSet( combined, conds )
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
> res <- nbinomTest( cds, "EL", "EW" )
> write.csv(res, "EL-vs-EW.csv")

...

Version	Old Version 15	New Version 16
Changes made by	Jeffrey E Barrick	Jeffrey E Barrick
Saved on	May 22, 2012	May 22, 2012

Versions Compared

Key

Convert alignments to BAM

Optional Exercise

Count reads mapping to genes using bedtools