Single Nucleotide Variant (SNV) calling Tutorial GVA2020
Overview:
SAMtools is a suite of commands for dealing with databases of mapped reads. You'll be using it quite a bit throughout the course. It includes programs for performing variant calling (mpileup-bcftools). This tutorial expects you have already completed the Mapping tutorial.
Learning Objectives
- Gain important insight into version control.
- Familiarize yourself with SAMtools.
- Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.
Loading SAMtools – a lesson in version control
One of the most important aspects of science is that it is supposed to be reproducible, and as mentioned in an earlier tutorial, a computer will always do exactly what it is told... the trick is telling it to do what you actually want it to do. As bench scientists, we all know (or will soon learn) that protocols change slightly over time... maybe you have had the nightmare troubleshooting experience of a reliable protocol suddenly giving unreliable results only to find out that an enzyme/reagent/kit you bought from a different vendor because it was cheeper is actually not identical in every way, or maybe you find a kit or reagent that claims better yield yet forces small differences in your protocol. Computational biology is no different in that protocols and programs change slightly over time (usually in the form of version updates). In the "best" case, version improvements add new functionality that do not change old analysis, in the worst of cases in an effort to fix small bugs (thereby increase accuracy by eliminating false positives in the eyes of the developers at least) in a way that makes you unaware that anything has changed other than your final output if you have to repeat your analysis (say because you added new samples to your cohort). Sometimes, programs will change drastically enough that even your old commands stop working. This is both a blessing and a curse. A blessing in that you are astutely aware that something has changed, and you are forced to either fix/update your analysis to the new version (typically gaining an understanding of what was changed), and a curse in that you have to figure out how to fix things even if this means continuing to use an older version.
As an optional extension of this tutorial you will have the opportunity to experience this first hand as you have access to 2 different versions of samtools 1 of which works for this tutorial, and the other which does not.
First let's check if SAMtools is loaded. The easiest way to do this is to simply type samtools. (Remember that most programs/commands are in all lowercase (while scripts often have capital letters) despite their webpages having capital letters associated with them to make them stand out). Looking through the output you should see a line that reads:
Version: 1.6 (using htslib 1.6)
This is very important information for the most detailed reporting of your computational analysis, and reproducibility of said analysis. Sadly this level of reporting is often ignored or not appreciated by many journals leading to difficulty in reproducing results.
Some modules on TACC offer multiple different versions, and sadly (for many biological modules at least), the default version is not always the newest version. Can you use the module system to determine if there are other versions of samtools available?
You should notice that the only version of samtools that is available through the module system on tacc is version 1.6 and that the module is currently loaded because of what we put in the .bashrc folder in your $HOME directory yesterday. Try to unload the samtools module and see if you have any other versions of samtools available in other locations.
Somewhat unfortunately, there is no output given when a module is unloaded, even if you try to unload something that doesn't exist, or wasn't loaded currently:
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache. If you require further assistance, please email wikihelp@utexas.edu.