Single Nucleotide Variant (SNV) calling Tutorial GVA2019
Overview:
SAMtools is a suite of commands for dealing with databases of mapped reads. You'll be using it quite a bit throughout the course. It includes programs for performing variant calling (mpileup-bcftools).
Learning Objectives
- Familiarize yourself with SAMtools.
- Gain important insight into version control.
- Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.
Calling variants in reads mapped by bowtie2
Right now, we'll be using it to call variants (find mutations) in the re-sequenced E. coli genome from the Mapping tutorial. You will need the output SAM files from that tutorial to continue here. If you wish to start this tutorial without completing the Mapping Tutorial, see the bottom section of this page for information about downloading canned data.
We assume that you are still working in the main directory called GVA_bowtie2_mapping that you created on $SCRATCH
.
Loading SAMtools – a lesson in version control
One of the most important aspects of science is that it is supposed to be reproducible, and as mentioned in an earlier tutorial, a computer will always do exactly what it is told... the trick is telling it to do what you actually want it to do. As bench scientists, we all know (or will soon learn) that protocols change slightly over time... maybe you have had the nightmare troubleshooting experience of a reliable protocol suddenly giving unreliable results only to find out that an enzyme/reagent/kit you bought from a different vendor because it was cheeper is actually not identical in every way, or maybe you find a kit or reagent that claims better yield yet forces small differences in your protocol. Computational biology is no different in that protocols and programs change slightly over time (usually in the form of version updates). In the "best" case, version improvements add new functionality that do not change old analysis, in the worst of cases in an effort to fix small bugs (thereby increase accuracy by eliminating false positives in the eyes of the developers at least) in a way that makes you unaware that anything has changed other than your final output if you have to repeat your analysis (say because you added new samples to your cohort). Sometimes, programs will change drastically enough that even your old commands stop working. This is both a blessing and a curse. A blessing in that you are astutely aware that something has changed, and you are forced to either fix/update your analysis to the new version (typically gaining an understanding of what was changed), and a curse in that you have to figure out how to fix things even if this means continuing to use an older version.
As an optional extension of this tutorial you will have the opportunity to experience this first hand as you have access to 2 different versions of samtools 1 of which works for this tutorial, and the other which does not.
First let's check if SAMtools is loaded. The easiest way to do this is to simply type samtools. (Remember that most programs/commands are in all lowercase (while scripts often have capital letters) despite their webpages having capital letters associated with them to make them stand out). Looking through the output you should see a line that reads:
Version: 1.6 (using htslib 1.6)
This is very important information for the most detailed reporting of your computational analysis, and reproducibility of said analysis. Sadly this level of reporting is often ignored or not appreciated by many journals leading to difficulty in reproducing results.
Some modules on TACC offer multiple different versions, and sadly (for many biological modules at least), the default version is not always the newest version. Can you use the module system to determine if there are other versions of samtools available?