Page Comparison

...

The newest version of R on lonestar is currently 2.14, but deep SNV deepSNV requires R version 2.15.

You can install your own version of R 2.15 on TACC using the instructions below, but this takes a while to compile, so you can also just add this a location to your path by adding this line to your ~/.profile_user file. We $PATH where we have installed R in this location2.15:

Code Block

title	Using the copy of R 2.15 that we have already installed

export PATH="/corral-repl/utexas/BioITeam/ngs_course/local/bin:$PATH"

...

A mixed population of E. coli from an evolution experiment was sequenced at several different time points (PMID: 19838166 , PMID:19776167). At generation 0 it consisted of a clone (cells grown from a colony with essentially no genetic variation), then additional samples were taken at after 20K and 40K generations after which mutations arose and swept through the (asexual) population(roughly 10 and 20 years) of culture in the laboratory. By these later times new mutations had arisen. Some had taken over (swept to fixation in) the population and were present in 100% of the individuals. Some mutations were only present in certain subpopulations and thus they were are intermediate frequencies (like 25%) in the sample.

Data

The data files for this example are in the path:

...

We renamed the FASTA sequence from "gi|254160123|ref|NC_012967.1|" to "NC_012967" by changing the first line of the NC_012967.1.fasta sequence using a text editor. It's just easier to deal with the shorter name.

Map Reads

Choose an appropriate program and map the reads. As for other variant callers, convert the mapped reads to BAM format, then sort and index the BAM file.

...

Code Block

title	Example command for running FreeBayes

login1$ freebayes --min-alternate-total 3 --ploidy 100 --pooled --vcf SRR032374.vcf \
        --fasta-reference NC_012967.1.fasta SRR032374.sorted.bam

Nice! We now have a familiar VCF file.

Additional exercises

Write a script or use a linux command to filter the output files to only contain variants that are predicted to have frequencies > 0.10 and < 0.90 or scores > 100.
What does the --min-alternate-total option mean. Experiment with using other options and examine how they affect speed versus what is predicted. Beware that some choices of options can cause crashes.
Compare the variants predicted in samples SRR032374 and SRR032376. You could view them in IGV.

Run deepSNV

deepSNV runs more slowly, so we will only look at a small region of the genome initially in interactive mode. (Why is it slower? Probably in part due to using a more sophisticated statistical model and in part because it is implemented in R instead of pure C.)

...

deepSNV website
deepSNV paper ... but UT library does not have access to this journal...
News article about deepSNV
deepSNV R module vignette

Code Block

title	Example deepSNV commands

login1$ R
...
> library("deepSNV")
> regions <- data.frame(chr="NC_012967", start = 1, stop=100000)
> result = deepSNV(test = "SRR032376.sorted.bam", control = "SRR030252.sorted.bam", regions=regions)
> sig_result <- summary(result, sig.level=0.05, adjust.method="BH")
> pdf("SRR032374.pdf")
> plot(mix)
> dev.off()
> write.csv(sig_result, "SRR032374")

The output is as a CSV (comma-separated) table, that can be loaded in R, Excel, or the like.

You have created a nice little plot SRR032374.pdf that you can copy back to your computer for viewing that compares allele frequencies between the samples.

Additional exercises

Create an R script to run from the command line to execute these commands, and try running the entire E. coli genome as a job on TACC. You might need to know that you can run an R script (script.r) from the command using one of these command lines:
Code Block
Rscript script.r R --vanilla < script.r
Repeat for SRR032374. Compare the variants predicted in samples SRR032374 and SRR032376.

Versions Compared

Old Version 21

New Version 22

Key

Data

Map Reads

Additional exercises

Run deepSNV

Additional exercises