...
File Name | Description | Sample |
|---|---|---|
| Single-end Illumina 36-bp reads | wild-type, biological replicate 1 |
| Single-end Illumina 36-bp reads | ΔsigB mutant, biological replicate 1 |
| Single-end Illumina 36-bp reads | wild-type, biological replicate 2 |
| Single-end Illumina 36-bp reads | ΔsigB mutant, biological replicate 2 |
| Reference Genome sequence (FASTA) | Listeria monocytogenes strain 10403S |
| Reference Genome features (GFF) | Listeria monocytogenes strain 10403S |
...
We've done this several times before, so you should be able to come up with the full command lines if you refer back to the original lesson. Be careful we are now mapping single-end reads, so you may have to look at the bowtie help to figure out how to do that!
You will need to first build the index file, just once and in "interactive mode" is fine. Then, you will need to submit a commands file with four lines to the TACC queue.
...
| Expand | ||||
|---|---|---|---|---|
| ||||
Now create a commands file that looks like this:
|
---+++ Align reads to reference genome
---++++ Using bowtie2
First, index your genome so bowtie2 can map read to it: %BR%
<code>$bowtie2-build REL606.fna REL606</code> %BR%
Then, align each data set: %BR%
<code>$bowtie2 -x REL606 -U datasetX.fastq --phred33 -S REL606.sam</code> %BR%
Optionally, add the <code>--local</code> flag if your reads do not map end-to-end.
---++++ Using BWA
First, index your genome so BWA can map read to it: %BR%
<code>$bwa index REL606.fna</code> %BR%
Then, align each data set: %BR%
<code>$bwa aln REL606.fna dataset1.fastq > datasetX.sai </code> %BR%
And convert to SAM format (assumes single-end data):
<code>$bwa samse REL606.fna datasetX.sai datasetX.fastq > datasetX.sam </code> %BR%
---++ Count reads mapping to genes
<code>breseq RNASEQ -f REL606.fna -r REL606.gbk -o datasetX.count.tab datasetX.sam</code> %BR%
---++ Convert alignments to BAM
And convert to BAM format (assumes single-end data):
| Code Block |
|---|
login1$ samtools faidx REL606.fna
login1$ samtools import REL606.fna datasetX.sam datasetX.unsorted.bam
login1$ samtools sort datasetX.unsorted.bam datasetX
login1$ samtools index datasetX.bam
|
...
Create the launcher script:
Run it:
|
Convert alignments to BAM
Edit your commands file so that you convert all of these files from SAM to sorted and indexed BAM.
Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with ;.
Note the use of the variable $FILE, which means that is the only part of the line that we have to change. This is a mini-use of shell scripting.
| Code Block |
|---|
FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034451 ; samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam ; samtools sort $FILE.unsorted.bam $FILE ; samtools index $FILE.bam
FILE=SRR034452 ; samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam ; samtools sort $FILE.unsorted.bam $FILE ; samtools index $FILE.bam
FILE=SRR034453 ; samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam ; samtools sort $FILE.unsorted.bam $FILE ; samtools index $FILE.bam
|
Re-create the launcher script and submit this new job to the queue.:
| Code Block |
|---|
module load samtools
launcher_creator.py -n samtools -e you@somewhere.com
qsub launcher.sge
|
Optional Exercise
- Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.
...
hg2. Analyze differential gene expression
- Use bedtools to count reads in features.
- Converting mapped reads to feature counts.
hg3. Using DESeq
...
| title | DESeq installation |
|---|
...
Count reads mapping to genes using bedtools
bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.
We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:
| Code Block |
|---|
module load bedtools
bedtools
bedtools multicov
|
The multicov command takes a feature file (GFF) and counts how many reads in. By default it counts how many reads overlap the feature on either strand, but it can be made specific with an option.
Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.
| Code Block |
|---|
grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_017544.1.gff > NC_017544.1.genes.gff
|
What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use head to see the before and after.
| Code Block |
|---|
head -n 50 NC_017544.1.gff
head -n 50 NC_017544.1.genes.gff
|
In order to use the bedtools command on our data, submit this commands file to the TACC queue:
| Code Block |
|---|
bedtools multicov -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff
|
hg2. Analyze differential gene expression (DESeq)
...
| Code Block | ||
|---|---|---|
| ||
login1$ R
...
> library("DESeq")
> combined = read.csv("combined.csv", header=T, row.names=1)
> design <- data.frame(
row.names = colnames( combined ),
condition = c( "Anc", "Anc", "EL", "EL", "EW", "EW"),
libType = c( "single-end", "single-end", "single-end",
"single-end", "single-end", "single-end" ) )
> design
> conds <- factor(design$condition)
> cds <- newCountDataSet( combined, conds )
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
> res <- nbinomTest( cds, "EL", "EW" )
> write.csv(res, "EL-vs-EW.csv")
|
...