Identification of Variants in Mixed Population Sequencing (GVA14)
Overview:
Identification of variants in mixed population sequencing data uses the same principles as their identification in clonal (homogeneous) sources of DNA. The difference is that the number of reads supporting each variant must be counted and compared to all other variants at the same location to determine the frequency of said variant rather than just listing what variants are present. Under clonal sequencing conditions, sequencing errors can safely and effectively be ignored. Conversely, with mixed population sequencing, sequencing errors become the lower bound limit of detection, and the potential accuracy of the experiment.
Here we will demonstrate effective use of breseq to identify variants in mixed population data and gain insight into some of the error correction breseq provides.
Learning Objectives:
- Identify variants in mixed population sequencing data.
- Understand the sources of false positive and false negative variants.
- Leverage knowledge of false positive errors to eliminate these types of errors
Tutorial:
The optional tutorial from day 2 (Advanced variant calling tutorial (GVA14)) detailed the use of the breseq pipeline to call variants on clonal samples from an evolving E. coli population. While this tutorial does not require completion of the previous tutorial, many of the finer points breseq are better covered there. This tutorial will focus on the use of the polymorphic mode of breseq to identify variants from a mixed population.
All fastq files necessary for this tutorial can be found inside the $BI/gva_course/mixed_population
folder. Copy all fastq files to a new folder named fastq
, and REL606.6.gbk
to a new folder named reference
.
module load bowtie/2.1.0
The above must be run for breseq to work
By default breseq preforms several statistical tests to rule out false positives. To make use of these tests, simply add a -p
flag to any breseq command. To highlight what breseq is normally doing by default we will run the same fastq files with and without several of the statistical tests. Specifically, base quality scores, polymorphism scores, polymorphism bias, and minimum strand coverage will be ignored. All 4 of these arguments can be found in the breseq -h output and their values should be set to 0. While running breseq in polymorphism mode is a fairly simple, due to the complexity of the command with turning off all the additional options, it is recommended that you copy paste these commands into a commands file or an idev session.
These commands will take ~25-30 minutes to finish running each. If running in an idev node, a single ampersand "&" can be added to the end of the line so the command will run in the background while allowing you to have your prompt back to run the other option. If you have already started the command before reading this a useful trick on linux systems to move a running process to the background is the following:
Ctrl + z jobs %jobnumber & (in your case, you should only have 1 job running so you would type: "%1 &" for the previous line
While we wait for these 2 runs to complete, we will go over the source of some of the errors that pop up in this type of analysis, and things that can be done to try to correct for them...
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.