...
You will need to first build the index file, just once and in "interactive mode" is fine (it's fast, so you don't need an idev shell). Then, you will need to submit a commands file with four lines to the TACC queue using qsub.
Please give the final output files the names: SRR034450.sam, SRR034451.sam, SRR034452.sam, SRR034453.sam.
| Expand |
|---|
| I just want a little hint |
|---|
| I just want a little hint |
|---|
|
Remember, bowtie-build once then bowtie for each separate sample. |
| Expand |
|---|
| Just give me the answerPlease take me through all of the steps...Just give me the answer |
|---|
| Please take me through all of the steps... |
|---|
|
| Code Block |
|---|
module load bowtie
bowtie-build NC_017544.1.fasta NC_017544.1
|
Now create a bowtie_commands file that looks like this using nano or another text editor: | Code Block |
|---|
bowtie -p 3 -S NC_017544.1 SRR034450.fastq -S SRR034450.sam
bowtie -p 3 -S NC_017544.1 SRR034451.fastq -S SRR034451.sam
bowtie -p 3 -S NC_017544.1 SRR034452.fastq -S SRR034452.sam
bowtie -p 3 -S NC_017544.1 SRR034453.fastq -S SRR034453.sam
|
Create the launcher script and run it: | Warning |
|---|
Remember that you cannot qsub from within an idev shell! |
| Code Block |
|---|
module load python
launcher_creator.py -n bowtie -q development -c bowtie_commands -t 0:30:00
qsub launcherbowtie.sge
|
|
Convert Bowtie output to BAM
Edit your Create a new samtools_commands file so that you convert all of these files from SAM to sorted and indexed BAM all at one time by using qsub.
Linux expert tip: you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line with &&.
Note the use of the variable $FILE, which means that is the only part of the line that we have to changeby which we set a variable's value int he first part of the line and use it over and over in the latter part of the line. This is a mini - use of shell scripting.
| Code Block |
|---|
| title | Contents of samtools_commands file |
|---|
|
FILE=SRR034450 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034451 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034452 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
FILE=SRR034453 && samtools import NC_017544.1.fasta $FILE.sam $FILE.unsorted.bam && samtools sort $FILE.unsorted.bam $FILE && samtools index $FILE.bam
|
Re-create the Create a new launcher script and submit this new job to the queue. Be sure you have samtools loaded as the node that your job launches on will inherit your current environment, including whatever modules you have loaded:
| Code Block |
|---|
module load samtools
|
| Expand |
|---|
| I'd like to see the commands for this qsub... |
|---|
| I'd like to see the commands for this qsub... |
|---|
|
| Code Block |
|---|
launcher_creator.py -n samtools -q development -j samtools_commands - | e you@somewhere.comlauncher |
Optional Exercise
- Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.
...
The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with the -s option. option. Unfortunately, this option only exists for the multicov command in a version of bedtools that is newer than the module on TACC, so we don't include it in the example command below.
Note: Remember that the chromosome names in your gff file should match the way the chromosomes are named in the reference fasta file used in the mapping step. For example, if gff BAM file used for mapping contains chr1, chrX etc, the GFF file must also call the chromosomes as chr1, chrX and so on.
Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.
| Code Block |
|---|
| title | This is all one line... |
|---|
|
grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene' NC_017544.1.gff > NC_017544.1.genes.gff
|
...
In order to use the bedtools command on our data, submit this commands file do the following:
| Warning |
|---|
| Submit to the TACC queue or run in an idev shell |
|---|
| Submit to the TACC queue |
|---|
|
...
|
| Code Block |
|---|
bedtools multicov - | s -bams SRR034450.bam SRR034451.bam SRR034452.bam SRR034453.bam -bed NC_017544.1.genes.gff > gene_counts.gff
|
|
Then take a peek at the data...
| Code Block |
|---|
head gene_counts.gff
|
| Warning |
|---|
Optional: HTseqHTseq is another tool to count reads. bedtools has many many useful functions, and counting reads is just one of them. In contrast, HTseq is a specialized utility for counting reads, and it does not have many functions other than that. HTseq is very slow and you need to run multiple command lines in order to do the same job as what bedtools multicov did. Why do we learn this? Well, you may want to care about reads mapped on intersection when you count reads. Please take a look at this page, and if this sophisticated counting method looks useful for you, use HTseq. Otherwise, use bedtools. | Code Block |
|---|
grep "^NC_017544" NC_017544.1.gff > count_ref.gff
samtools view SRR034450.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count1.gff
samtools view SRR034451.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count2.gff
samtools view SRR034452.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count3.gff
samtools view SRR034453.bam | htseq-count -m intersection-nonempty -t gene -i ID - count_ref.gff > count4.gff
|
Analyze differential gene expression DESeqOur data that is cluttered with a lot of extra columns and one column stuffed with tag=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files. We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl or Python script that would check to be sure that each line had the locus_tag (they do), among other things. | Code Block |
|---|
| title | Reformatting gene_counts.gff |
|---|
|
head gene_counts.gff
sed 's/^.*locus_tag=//' gene_counts.gff > gene_counts.tab
|
After it has run, take a peek at the new file: | Code Block |
|---|
head gene_counts.tab
|
| Warning |
|---|
| title | Be very careful how you copy and paste from the example below. |
|---|
| |
Do not copy the > characters. Some commands are spread across multiple lines. The > are missing at the beginning of the lines after the first one in these cases. So this: | Code Block |
|---|
> y <- c(
1:10
)
> y
|
Is the same as: | Code Block |
|---|
> y <- c(1:10)
> y
|
It's ok to copy across the multiple lines and paste into R as long as you get all the way to the closing parenthesis. |
...
You should copy the two *.pdf files that were created back to your local computer to view them .using scp
Exercises
- What are the numbers returned by
sizeFactors( cds )? | Expand |
|---|
|
They are, roughly speaking, the relative average coverage of each data set. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file. |
- What are the dispersion estimates?
| Expand |
|---|
|
The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). In this model, the lower the counts are, the more dispersion relative to the mean is expected (red line in graph). Thus, higher fold changes are required in lowly expressed genes to call the same observed fold-change difference as significant. |
- What was the predominant effect of the mutation on gene expression in this Listeria strain?
...