Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Familiarize yourself with SAMtools.
  2. Gain important insight into version control.
  3. Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.

...

We assume that you are still working in the main directory called BDIB_bowtie2_mapping that you created on $SCRATCH.

Load SAMtools

Check if SAMtools is loaded and if not Load the SAMtools module.

...

titleClick here for a hint without the answer

...

Loading SAMtools – a lesson in version control

One of the most important aspects of science is that it is supposed to be reproducible, and as mentioned in an earlier tutorial, a computer will always do exactly what it is told... the trick is telling it to do what you actually want it to do. As bench scientists, we all know (or will soon learn) that protocols change slightly over time... maybe you have had the nightmare troubleshooting experience of a reliable protocol suddenly giving unreliable results only to find out that an enzyme/reagent/kit you bought from a different vendor because it was cheeper is actually not identical in every way, or maybe you find a kit or reagent that claims better yield yet forces small differences in your protocol. Computational biology is no different in that protocols and programs change slightly over time (usually in the form of version updates). In the "best" case, version improvements add new functionality that do not change old analysis, in the worst of cases in an effort to fix small bugs (thereby increase accuracy by eliminating false positives in the eyes of the developers at least) in a way that makes you unaware that anything has changed other than your final output if you have to repeat your analysis (say because you added new samples to your cohort). Sometimes, programs will change drastically enough that even your old commands stop working. This is both a blessing and a curse. A blessing in that you are astutely aware that something has changed, and you are forced to either fix/update your analysis to the new version (typically gaining an understanding of what was changed), and a curse in that you have to figure out how to fix things even if this means continuing to use an older version.

As an optional extension of this tutorial you will have the opportunity to experience this first hand as you  have access to 2 different versions of samtools 1 of which works for this tutorial, and the other which does not.

First let's check if SAMtools is loaded. The easiest way to do this is to simply type samtools. (Remember that most programs/commands are in all lowercase (while scripts often have capital letters) despite their webpages having capital letters associated with them to make them stand out). Looking through the output you should see a line that reads:

No Format
Version: 0.1.18 (r982:295)

This is very important information for the most detailed reporting of your computational analysis, and reproducibility of said analysis. Sadly this level of reporting is often ignored or not appreciated by many journals leading to difficulty in reproducing results.

Some modules on TACC offer multiple different versions, and sadly (for many biological modules at least), the default version is not always the newest version. Can you use the module system to determine if there are other versions of samtools available?

Expand
titleClick here for a hint without the answer

Remember we use the the base command "module" to list the installed modules, and find the available modules, and then spider to access more detailed information (including what versions are available).

Code Block
languagebash
titleclick here to check your work, or get the answer
collapsetrue
module list samtools
module avail samtools
module spider samtools

You should notice that the only version of samtools that is available through the module system on tacc is version 1.3 which is clearly different than version 0.1.18. Can you figure out where version 0.1.18 is being loaded from?

Expand
titleNeed a hint?
  1. The which command searches through your path for an executable matching whatever comes after it, and returns the location that command is stored.
  2. Tab (and double tab) will work with the which command
Expand
titleAnswer:
No Format
nopaneltrue
tacc:~$ which samtools 
/corral-repl/utexas/BioITeam/breseq/bin/samtools

Hopefully, you recognize this as part of the BioITeam community resources.

You may have noticed that module list samtools told you that you had version 1.3 loaded already. This raises the question of why samtools points at version 0.1.18 in the BioITeam resources rather than the module version. To understand this, consider the following information about the module system and the $PATH variable:

  1. When you type a command, only locations that are in your PATH variable are searched for an executable command matching that name.
  2. When the command is found in your PATH, the computer immediately substitutes the location it was found for the short command name you entered, and stops searching.
    1. This means that things that are early in your path are always searched first. In some extreme cricumstances if you add a bunch of locations with lots of files to the front of your PATH, you can actually slow down your entire computer. 
  3. The module system always assumes that when you load a module, you intend to use it, and thus puts the executables for that module at the front of your PATH.
  4. In your .bashrc file, modules are loaded first (including samtools). 
  5. After modules are loaded, we further manipulate your PATH variable several times. The last time with the following line:
    1. No Format
      export PATH=$BI/breseq/bin:$PATH
    2. This command says make the variable PATH equal to the variable BI plus /breseq/bin and then add on the existing value of $PATH
    3. Warning
      titleOne of the most important lessons you can ever learn

      anytime you manipulate your PATH variable you always want to make sure that you include $PATH on the right side of the equation somewhere separated by : either before it, after it, or on both sides of it if you want it in the middle of 2 different locations. As we are explaining right now, there are reasons and times to put it in different relative places, but if you fail to include it (or include a typo in it by calling it say $PTAH) you can actually remove access to all existing commands (including the most basic things like "ls" "mkdir" "cd".

  6. Together all this means that the last modification to your PATH was the addition of a different program (breseq) to your PATH, unexpectedly breseq also includes a samtools distribution making it the first thing that is found when you use the command samtools. This is not an uncommon type of thing for programs to do when they rely on a specific version of a program to run themselves.
Expand
titleclick here for instructions on how to test/verify that this is true

Simply reload samtools using the module system, check the version, and where that version is now being called from.

Code Block
languagebash
titleclick here to check your work, or get the answerstill stuck?
collapsetrue
module load listsamtools
samtools  # check moduleoutput availfor samtoolsversion moduleinformation
loadwhich samtools

 

expand
samtools
Warning
Code Block
languagebash
titleThis should work:
collapsetrue
title
Is there any harm in typing module load *** if *** is already loaded?

No, there is no harm. In some odd circumstances it seems required to make things work exactly as intended. Similar to the "logout and log back in" trick we've used for TACC, and the older "restart your computer" tricks, module load **** can be used to see if something with a module is not behaving as intended.

Can you figure out what version of samtools is loaded on TACC and where it is installed?

As alluded to in the introduction, this tutorial is designed to run (and will actually only run) with one of these 2 versions. That version is the version that is included with breseq in the BioITeam distrubtion. At the end of this tutorial there is an optional tutorial using the module version of samtools, but for now...

execute the following 2 commands and make sure you get the 3rd line as output:

No Format
tacc:~$ module unload samtools
tacc:~$ which samtools
/corral-repl/utexas/BioITeam/breseq/bin/samtools

Prepare your directories

Since the $SCRATCH directory on lonestar is effectively infinite for our purposes, we're going to copy the relevant files from our mapping tutorial into a new directory for this tutorial. This should help you identify what files came from what tutorial if you look back at it in the future. Let's copy over just the read alignment file in the SAM format and the reference genome in FASTA format to a new directory called BDIB_samtools_tutorial.

...