Genome variation in mixed samples (FreeBayes, deepSNV)
In this tutorial we will use two different programs to identify variants in mixed genomic samples where DNA from many individuals was pooled and sequenced together.
- FreeBayes can be used to call variants in genomes of any ploidy, pooled samples, or mixed populations. It can be used on single samples.
- deepSNV can be used to call single-nucleotide variants and single-base deletions in ultra-deep sequencing data sets. It compares variation between two samples.
Install FreeBayes
This tutorial assumes that you have created the paths $WORK/src
and $HOME/local/bin
and added $HOME/local/bin
to your $PATH
. FreeBayes uses a git repository and requires the cmake build system to compile. You can install it with these commands:
login1$ module load git login1$ mkdir -p $WORK/src && cd $WORK/src login1$ git clone --recursive git://github.com/ekg/freebayes.git login1$ cd freebayes login1$ module load cmake login1$ make login1$ mv bin/* $HOME/local/bin
This command from the FreeBayes instructions attempts to install to a system-wide location as super-user:
sudo make install
This won't work on Lonestar! (You aren't an admin.) However, the make
command already created the executables inside of the bin
directory in the source tree, so we can find and move them to our standard $HOME/local/bin
directory with the last command.
Install deepSNV
The newest version of R on lonestar is currently 2.14, but deepSNV requires R version 2.15.
You can install your own version of R 2.15 on TACC using the instructions below, but this takes a while to compile, so you can also just add a location to your $PATH where we have installed R 2.15:
export PATH="/corral-repl/utexas/BioITeam/ngs_course/local/bin:$PATH"
If you want to go through installing R 2.15 and deepSNV for yourself, here's how:
login1$ wget http://cran.wustl.edu/src/base/R-2/R-2.15.0.tar.gz login1$ tar -xvzf R-2.15.0.tar.gz login1$ cd R-2.15.0 login1$ ./configure --prefix=$HOME/local login1$ make login1$ make install
Once you have access to R 2.15, you can install deepSNV using these commands (which work for any BioConductor package).
login1$ R ... > source("http://bioconductor.org/biocLite.R") > biocLite("deepSNV")
Example 1: Mixed E. coli population
A mixed population of E. coli from an evolution experiment was sequenced at several different time points (PMID: 19838166 , PMID:19776167). At generation 0 it consisted of a clone (cells grown from a colony with essentially no genetic variation), then additional samples were taken after 20K and 40K generations (roughly 10 and 20 years) of culture in the laboratory. By these later times new mutations had arisen. Some had taken over (swept to fixation in) the population and were present in 100% of the individuals. Some mutations were only present in certain subpopulations and thus they were are intermediate frequencies (like 25%) in the sample.
Data
The data files for this example are in the path:
/corral-repl/utexas/BioITeam/ngs_course/ecoli_mixed
File Name |
Description |
---|---|
|
Illumina reads, 0K generation individual clone from population |
|
Illumina reads, 20K generation mixed population |
|
Illumina reads, 40K generation mixed population |
|
E. coli B str. REL606 genome |
The read files were downloaded from the ENA SRA study.
So that you can treat all the data as single-ended for simplicity, we concatenated two separate FASTQ (paired-end) files for sample SRR030252 using this command
cat SRR030252_1.fastq SRR030252_2.fastq > SRR030252.fastq
Alternatively, you could map that data set as paired-end.
The reference genome file was downloaded from the NCBI Genomes page.
We renamed the FASTA sequence from "gi|254160123|ref|NC_012967.1|" to "NC_012967" by changing the first line of the NC_012967.1.fasta sequence using a text editor. It's just easier to deal with the shorter name.
Map Reads
Choose an appropriate program and map the reads. As for other variant callers, convert the mapped reads to BAM format, then sort and index the BAM file.
Additional exercises
- Determine the approximate depth of mapped read coverage for each sequencing data set.
- Try using different mappers or changing the default alignment settings to find more variants.
Run FreeBayes
FreeBayes can be used to treat the sample as a mixture of pooled samples. (In our case it is actually a mixture of >1 million bacteria, but we have nowhere near that level of coverage, so we give an arbitrary mixed ploidy of 100, which means we use a statistical model that predicts variants only with frequencies of 1%, 2%, 3%, ... 98%, 99%, 100%). This command runs pretty fast, so you can do it in interactive mode.
login1$ freebayes --min-alternate-total 3 --ploidy 100 --pooled --vcf SRR032374.vcf \ --fasta-reference NC_012967.1.fasta SRR032374.sorted.bam
Nice! We now have a familiar VCF file.
Additional exercises
- Write a script or use a linux command to filter the output files to only contain variants that are predicted to have frequencies > 0.10 and < 0.90 or scores > 100.
- What does the
--min-alternate-total
option mean. Experiment with using other options and examine how they affect speed versus what is predicted. Beware that some choices of options can cause crashes. - Compare the variants predicted in samples SRR032374 and SRR032376. You could view them in IGV.
Run deepSNV
deepSNV runs more slowly, so we will only look at a small region of the genome initially in interactive mode. (Why is it slower? Probably in part due to using a more sophisticated statistical model and in part because it is implemented in R instead of pure C.)
Useful Links
- deepSNV website
- deepSNV paper ... but UT library does not have access to this journal...
- News article about deepSNV
- deepSNV R module vignette
login1$ R ... > library("deepSNV") > regions <- data.frame(chr="NC_012967", start = 1, stop=100000) > result = deepSNV(test = "SRR032376.sorted.bam", control = "SRR030252.sorted.bam", regions=regions) > sig_result <- summary(result, sig.level=0.05, adjust.method="BH") > pdf("SRR032374.pdf") > plot(mix) > dev.off() > write.csv(sig_result, "SRR032374")
The output is as a CSV (comma-separated) table, that can be loaded in R, Excel, or the like.
You have created a nice little plot SRR032374.pdf
that you can copy back to your computer for viewing that compares allele frequencies between the samples.
Additional exercises
- Create an R script to run from the command line to execute these commands, and try running the entire E. coli genome as a job on TACC. You might need to know that you can run an R script (
script.r
) from the command using one of these command lines:Rscript script.r R --vanilla < script.r
- Repeat for SRR032374. Compare the variants predicted in samples SRR032374 and SRR032376.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.