Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Don't forget: it's q() to quit.
  • For help, type ?command. Try ?read.table. The q key gets you out of help, just like for a man page.
  • The left arrow \<- (less-than-dash) is the same as an equals sign \=. You can use them interchangeably.
  • The prompt we will sometimes be showing for R is >. Don't type this for a command. It is like the login1$ at the beginning of the bash prompt when you log in to Lonestar. It just means that you are in the R shell.
  • You can type the name of a variable to have its value displayed. Like this...
    Code Block
    
    > x <- 10 + 5 + 6
    > x
    

...

  • [1] 21
    

Like other languages, R can be expanded by loading modules. The R equivalent of Bioperl or Biopython is Bioconductor. Bioconductor can do things for you like convert sequences, but where it really shines is in doing statistical tests (where is it second-to-none in this list of languages). Many functions for analyzing microarray data are implemented in R, and this strength has now carried over to the analysis of RNAseq data.

...

Expand
Just give me the answer...
Just give me the answer...
Code Block
module load bowtiebowtie2
bowtiebowtie2-build NC_017544.1.fasta NC_017544.1

Now create a commands file that looks like this:

Code Block
bowtiebowtie2 -p 3 -samx NC_017544.1 -U SRR034450.fastq -S SRR034450.sam
bowtiebowtie2 -p 3 -samx NC_017544.1 -U SRR034451.fastq -S SRR034451.sam
bowtiebowtie2 -p 3 -samx NC_017544.1 -U SRR034452.fastq -S SRR034452.sam
bowtiebowtie2 -p 3 -samx NC_017544.1 -U SRR034453.fastq -S SRR034453.sam

Create the launcher script and run it:

Code Block
module load python
launcher_creator.py -n bowtie -e you@somewhere.com 

Run it:

Code Block
-t 0:30:00
qsub launcher.sge

h2. Convert alignments to BAM

...



Edit your {{commands}} file so that you convert all of these files from SAM to sorted and indexed BAM.

...



*Linux expert tip:* you can string together commands all on one line, so that they are sent to the same core one after another by separating them on the line

...

Note the use of the variable $FILE, which means that is the only part of the line that we have to change. This is a mini-use of shell scripting.

Code Block
FILE=SRR034450 && samtools import
 with {{&&}}.

Note the use of the variable $FILE, which means that is the only part of the line that we have to change. This is a mini-use of shell scripting.

FILE=SRR034450 && samtools import NC_017544.1.fasta

$FILE.sam

$FILE.unsorted.bam

&&

samtools

sort

$FILE.unsorted.bam

$FILE

&&

samtools

index

$FILE.bam


FILE=SRR034451

&&

samtools

import

NC_017544.1.fasta

$FILE.sam

$FILE.unsorted.bam

&&

samtools

sort

$FILE.unsorted.bam

$FILE

&&

samtools

index

$FILE.bam


FILE=SRR034452

&&

samtools

import

NC_017544.1.fasta

$FILE.sam

$FILE.unsorted.bam

&&

samtools

sort

$FILE.unsorted.bam

$FILE

&&

samtools

index

$FILE.bam


FILE=SRR034453

&&

samtools

import

NC_017544.1.fasta

$FILE.sam

$FILE.unsorted.bam

&&

samtools

sort

$FILE.unsorted.bam

$FILE

&&

samtools

index

$FILE.bam

Code Block


Re-create the launcher script and submit this new job to the queue.

...

Code Block
module load samtools
 Be sure you have samtools loaded as the node that your job launches on will inherit your current environment, including whatever modules you have loaded:

module load samtools
launcher_creator.py

-n

samtools

-e

you@somewhere.com


qsub

launcher.sge

Optional Exercise

...

Code Block


h4. Optional Exercise

* Is this a strand-specific RNA-seq library? Try using IGV to view some of the BAM file data and examine the reads mapped to each gene.

h2.

...

 Count reads mapping to genes using

...

bedtools is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

Code Block

module load bedtools
bedtools
bedtools multicov

The multicov command takes a feature file (GFF) and counts how many reads in. By default it counts how many reads overlap the feature on either strand, but it can be made specific with an option.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using grep.

Code Block
grep '^NC_017544[[:space:]]*GenBank[[:space:]]*gene'
 bedtools

[bedtools|http://code.google.com/p/bedtools/] is a great utility for working with sequence features and mapped reads in BAM, BED, VCF, and GFF formats.

We are going to use it to count the number of reads that map to each gene in the genome. Load the module and check out the help for bedtools and the multicov specific command that we are going to use:

module load bedtools
bedtools
bedtools multicov

Code Block


The multicov command takes a feature file (GFF) and counts how many reads are in certain regions from many input files. By default it counts how many reads overlap the feature on either strand, but it can be made specific with an option.

Our GFF file has a lot of redundant features that describe a gene multiple times, so we are going to trim it just to have "gene" features using {{grep}}.

grep '^NC_017544[space:]*GenBank[space:]*gene' NC_017544.1.gff

>

NC_017544.1.genes.gff

Code Block


What is this doing? It's taking all the lines that begin with (^), then "NC_017544", then any number of spaces or tabs, then "GenBank", then any number of spaces or tabs, then "gene". Use {{head}} to see the before and after.

Code Block

head

-n

50

NC_017544.1.gff


head

-n

50

NC_017544.1.genes.gff

Code Block


In order to use the bedtools command on our data, submit this commands file to the TACC queue:
Code Block

bedtools

multicov

-bams

SRR034450.bam

SRR034451.bam

SRR034452.bam

SRR034453.bam

-bed

NC_017544.1.genes.gff

>

gene_counts.gff


head

gene_counts.gff

Code Block


h2. 
Analyze differential gene expression (DESeq)

...

Our data that is cluttered with a lot of extra columns and one column stuffed with tag\=value information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files.

We can do this cleanup many ways, but a quick one is to use the Unix string editor sed. This command replaces the entire beginning of the line up to locus_tag= with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl script that would check to be sure that each line had the locus_tag (they do), among other things.

Code Block
titleReformatting gene_counts.gff


   * [DESeq Manual and Instructions|http://bioconductor.org/packages/devel/bioc/html/DESeq.html]
 
Our data that is cluttered with a lot of extra columns and one column stuffed with {{tag\=value}} information (including the gene names that we want!). Let's clean it up a bit before loading into R - which likes to work on simple tables. GFF are tab-delimited files. 

We can do this cleanup many ways, but a quick one is to use the Unix string editor {{sed}}. This command replaces the entire beginning of the line up to {{locus_tag=}} with nothing (that is, it deletes it). This conveniently leaves us with just the locus_tag and the columns of read counts in each gene. If you were writing a real pipeline, you would probably want to use a Perl or Python script that would check to be sure that each line had the locus_tag (they do), among other things.

{code:title=Reformatting gene_counts.gff}
head gene_counts.gff
sed 's/^.*locus_tag=//' gene_counts.gff >
gene_counts.tab

After it has run, take a peek at the new file:

Code Block
head
 gene_counts
.tab

...

.tab

After it has run, take a peek at the new file:

Code Block

head gene_counts.tab
Warning

Do not copy the > characters in the

...

examples. They

...

are the R prompt to remind you which commands you are running the inside of R!*

The commands for this example are also described in the DESeq vignette (PDF) .

Code Block
titleUsing DESeq

login1$ R
...
> library("DESeq")
> counts = read.delim("gene_counts.tab", header=F, row.names=1)
> head(counts)
> colnames(counts) = c("wt1", "mut1", "wt2", "mut2")
> head(counts)
> my.design <- data.frame(
  row.names = colnames( counts ),
  condition = c( "wt", "mut", "wt", "mut"),
  libType = c( "single-end", "single-end", "single-end", "single-end" ) 
)
> conds <- factor(my.design$condition)

> cds <- newCountDataSet( counts, conds )
> cds

> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )

> cds <- estimateDispersions( cds )

> pdf("DESeq-dispersion_estimates.pdf")
> plot(
  rowMeans( counts( cds, normalized=TRUE ) ),
  fitInfo(cds)$perGeneDispEsts,
  pch = '.', log="xy" )
  xg <- 10^seq( -.5, 5, length.out=300 )
  lines( xg, fitInfo(cds)$dispFun( xg ), col="red" )
)
> dev.off()

> result <- nbinomTest( cds, "wt", "mut" )
> head(result)

> result = result[order(result$pval), ]
> head(result)

> write.csv(result, "DESeq-wt-vs-mut.csv")

> pdf("DESeq-MA-plot.pdf")
> plot(
  result$baseMean,
  result$log2FoldChange,
  log="x", pch=20, cex=.3,
  col = ifelse( result$padj < .1, "red", "black" ) )
> dev.off()

> q()
Save workspace image? [y/n/c]: n
login1$ head wt-vs-mut.csv

wt-vs-mut.csv is a comma-delimited file that could be reloaded into R or viewed in Excel.

You should copy the two *.pdf files that were created back to your Desktop to view them.

Exercises

  • What are the numbers returned by sizeFactors( cds )?
    Expand
    Answer...
    Answer...

    They are, roughly speaking, the relative average coverage of each data set. Specifically, they are the size parameter of the negative binomial fit to the counts per gene per data file.
  • What are the dispersion estimates?
    Expand
    Answer...
    Answer...

    The model assumes there is also a per-gene aspect to the variance in counts observed, that is again fit to a negative binomial distribution (=overdispersed Poisson distribution). In this model, the lower the counts are, the more dispersion relative to the mean is expected (red line in graph). Thus, higher fold changes are required in lowly expressed genes to call the same observed fold-change difference as significant.

  • What was the predominant effect of the mutation on gene expression in this Listeria strain?

...