Single Nucleotide Variant (SNV) calling Tutorial GVA2016

 

Overview:

SAMtools is a suite of commands for dealing with databases of mapped reads. You'll be using it quite a bit throughout the course. It includes programs for performing variant calling (mpileup-bcftools).

Learning Objectives

  1. Familiarize yourself with SAMtools.
  2. Gain important insight into version control.
  3. Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.

Calling variants in reads mapped by bowtie2

Right now, we'll be using it to call variants (find mutations) in the re-sequenced E. coli genome from the Mapping tutorial. You will need the output SAM files from that tutorial to continue here. If you wish to start this tutorial without completing the Mapping Tutorial, see the bottom section of this page for information about downloading canned data.

We assume that you are still working in the main directory called BDIB_bowtie2_mapping that you created on $SCRATCH.

Loading SAMtools – a lesson in version control

One of the most important aspects of science is that it is supposed to be reproducible, and as mentioned in an earlier tutorial, a computer will always do exactly what it is told... the trick is telling it to do what you actually want it to do. As bench scientists, we all know (or will soon learn) that protocols change slightly over time... maybe you have had the nightmare troubleshooting experience of a reliable protocol suddenly giving unreliable results only to find out that an enzyme/reagent/kit you bought from a different vendor because it was cheeper is actually not identical in every way, or maybe you find a kit or reagent that claims better yield yet forces small differences in your protocol. Computational biology is no different in that protocols and programs change slightly over time (usually in the form of version updates). In the "best" case, version improvements add new functionality that do not change old analysis, in the worst of cases in an effort to fix small bugs (thereby increase accuracy by eliminating false positives in the eyes of the developers at least) in a way that makes you unaware that anything has changed other than your final output if you have to repeat your analysis (say because you added new samples to your cohort). Sometimes, programs will change drastically enough that even your old commands stop working. This is both a blessing and a curse. A blessing in that you are astutely aware that something has changed, and you are forced to either fix/update your analysis to the new version (typically gaining an understanding of what was changed), and a curse in that you have to figure out how to fix things even if this means continuing to use an older version.

As an optional extension of this tutorial you will have the opportunity to experience this first hand as you  have access to 2 different versions of samtools 1 of which works for this tutorial, and the other which does not.

First let's check if SAMtools is loaded. The easiest way to do this is to simply type samtools. (Remember that most programs/commands are in all lowercase (while scripts often have capital letters) despite their webpages having capital letters associated with them to make them stand out). Looking through the output you should see a line that reads:

Version: 0.1.18 (r982:295)

This is very important information for the most detailed reporting of your computational analysis, and reproducibility of said analysis. Sadly this level of reporting is often ignored or not appreciated by many journals leading to difficulty in reproducing results.

Some modules on TACC offer multiple different versions, and sadly (for many biological modules at least), the default version is not always the newest version. Can you use the module system to determine if there are other versions of samtools available?

 Click here for a hint without the answer

Remember we use the the base command "module" to list the installed modules, and find the available modules, and then spider to access more detailed information (including what versions are available).

click here to check your work, or get the answer
module list samtools
module avail samtools
module spider samtools

You should notice that the only version of samtools that is available through the module system on tacc is version 1.3 which is clearly different than version 0.1.18. Can you figure out where version 0.1.18 is being loaded from?

 Need a hint?
  1. The which command searches through your path for an executable matching whatever comes after it, and returns the location that command is stored.
  2. Tab (and double tab) will work with the which command
 Answer:
tacc:~$ which samtools 
/corral-repl/utexas/BioITeam/breseq/bin/samtools

Hopefully, you recognize this as part of the BioITeam community resources.

You may have noticed that module list samtools told you that you had version 1.3 loaded already. This raises the question of why samtools points at version 0.1.18 in the BioITeam resources rather than the module version. To understand this, consider the following information about the module system and the $PATH variable:

  1. When you type a command, only locations that are in your PATH variable are searched for an executable command matching that name.
  2. When the command is found in your PATH, the computer immediately substitutes the location it was found for the short command name you entered, and stops searching.
    1. This means that things that are early in your path are always searched first. In some extreme cricumstances if you add a bunch of locations with lots of files to the front of your PATH, you can actually slow down your entire computer. 
  3. The module system always assumes that when you load a module, you intend to use it, and thus puts the executables for that module at the front of your PATH.
  4. In your .bashrc file, modules are loaded first (including samtools). 
  5. After modules are loaded, we further manipulate your PATH variable several times. The last time with the following line:
    1. export PATH=$BI/breseq/bin:$PATH
    2. This command says make the variable PATH equal to the variable BI plus /breseq/bin and then add on the existing value of $PATH
    3. One of the most important lessons you can ever learn

      anytime you manipulate your PATH variable you always want to make sure that you include $PATH on the right side of the equation somewhere separated by : either before it, after it, or on both sides of it if you want it in the middle of 2 different locations. As we are explaining right now, there are reasons and times to put it in different relative places, but if you fail to include it (or include a typo in it by calling it say $PTAH) you can actually remove access to all existing commands (including the most basic things like "ls" "mkdir" "cd".

  6. Together all this means that the last modification to your PATH was the addition of a different program (breseq) to your PATH, unexpectedly breseq also includes a samtools distribution making it the first thing that is found when you use the command samtools. This is not an uncommon type of thing for programs to do when they rely on a specific version of a program to run themselves.
 click here for instructions on how to test/verify that this is true

Simply reload samtools using the module system, check the version, and where that version is now being called from.

still stuck?
module load samtools
samtools  # check output for version information
which samtools

As alluded to in the introduction, this tutorial is designed to run (and will actually only run) with one of these 2 versions. That version is the version that is included with breseq in the BioITeam distrubtion. At the end of this tutorial there is an optional tutorial using the module version of samtools, but for now...

execute the following 2 commands and make sure you get the 3rd line as output:

tacc:~$ module unload samtools
tacc:~$ which samtools
/corral-repl/utexas/BioITeam/breseq/bin/samtools

Prepare your directories

Since the $SCRATCH directory on lonestar is effectively infinite for our purposes, we're going to copy the relevant files from our mapping tutorial into a new directory for this tutorial. This should help you identify what files came from what tutorial if you look back at it in the future. Let's copy over just the read alignment file in the SAM format and the reference genome in FASTA format to a new directory called BDIB_samtools_tutorial.

Check your work or get the answer here
cds
mkdir BDIB_samtools_tutorial
cd BDIB_samtools_tutorial
cp $SCRATCH/BDIB_bowtie2_mapping/bowtie2/SRR030257.sam .
cp $SCRATCH/BDIB_bowtie2_mapping/NC_012967.1.fasta .

Index the FASTA reference file

First, you need to index the reference file. (This isn't indexing it for read mapping. It's indexing it so that SAMtools can quickly jump to a certain base in the reference.)

Command to index the reference file for SAMtools
samtools faidx NC_012967.1.fasta

Take a look at the new *.fai file that was created by this command see if you have any idea what some of the numbers mean. 

Alternative to head/tail/cat for examining a file without causing programs to crash
less NC_012967.1.fasta.fai  # can exit with "q"

As you can see, the less command also works perfectly well with files that are not in danger of crashing anything without cluttering your terminal  with lines of a file.

Convert mapped reads from SAM to BAM, sort, and index

SAM is a text file, so it is slow to access information about how any given read was mapped. SAMtools and many of the commands that we will run later work on BAM files (essentially GZIP compressed binary forms of the text SAM files). These can be loaded much more quickly. Typically, they also need to be sorted, so that when the program wants to look at all reads overlapping position 4,129,888, it can easily find them all at once without having to search through the entire BAM file.

The following 3 commands are used to convert from SAM to BAM format, sort the BAM file, and index the BAM file. As you might assume this is computationally intense and as such must be iDEV node or submitted as a job (more on this on Thursday). If you want to submit them to the job queue, you will want to separate them with a ";" to ensure that they run sequentially on the same node. Under no circumstances should you run this on the head node.

Do not run on head node

Use showq -u to verify you are still on the idev node.

Use this command to restart an idev session if you are not on one
idev  -m 120 -r CCBB_Bio_Summer_School_2016_Day2 -A UT-2015-05-18 -N 2 -n 8
Commands to be executed in order...
samtools view -b -S -o SRR030257.bam SRR030257.sam
samtools sort SRR030257.bam SRR030257.sorted
samtools index SRR030257.sorted.bam
This is a really common sequence of commands, so you might want to add it to your personal cheat sheet.

 

Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:

  •  What new files were created by these commands?
    List the contents of the output directory
    ls
    
    Expected output
    NC_012967.1.fasta      SRR030257.sorted.bam.bai
    NC_012967.1.fasta.fai  SRR030257.sam
    SRR030257.bam          SRR030257.sorted.bam
    
  •  Why didn't we name the output SRR030257.sorted.bam in the samtools sort command?

    Samtools appends an extra .bam to whatever we put here, so it would have created SRR030257.sorted.bam.bam, and then we would have had to make a joke about the Flintstones.

  •  Can you guess what a *.bai file is?

    Sure enough, it's the index file for the BAM file.

You might be tempted to gzip BAM files when copying them from one computer to another. Don't bother! They are already internally compressed, so you won't be able to shrink the file. On the other hand, compressing SAM files will save a fair bit of space.

Call genome variants

Now we use the mpileup command from samtools to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF).

This is one command. Put it all on one line.
samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf

What are all the options doing? Try calling samtools mpileup without any options to see if you can figure it out before clicking below to  

 Click here to expand...
Optionpurpose
-ugenerates uncompressed BCF output
-f NC_012967.1.fasta.fai

faidx indexed reference sequence file

SRR030257.sorted.bamBAM input file to calcluate pileups from
> SRR030257.bcfDirect output to SRR030257.bcf

 

The samtools mpileup command will take a few minutes to run. As practice for a fairly common occurrence when working with the iDEV environment, once the command is running, you should try putting it in the background by pressing control-z and then typing the command bg so that you can do some other things in this terminal window at the same time. Remember, there are still many other processors available on this node!

Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using the bcftools command (the which command will tell you where this command is located and examination of that path should tell you how you have access to it).

This is one command. Put it all on one line.
bcftools view -v -c -g SRR030257.bcf > SRR030257.vcf

What are these options doing?

 See if you can start with the base command "bcftools" and figure out what each option is doing. Click here when you think you know.
Optionpurpose
viewspecific command to be executed by bcftools
-v

output potential variant sites only

-cSNP calling
-gcall genotypes at variant sites
SRR030257.bcf
input bcf file
> SRR030257.vcf
output as a vcf file

Take a look at the SRR030257.vcf file using less. It has a nice header explaining what the columns mean, including answers to some of your questions from yesterday's presentations. Below this are the rows of data describing potential genetic variants.

Filtering VCF files with grep

VCF format has alternative Allele Frequency tags denoted by AF= Try the following command to see what values we have in our files.

grep AF1 SRR030257.vcf
 Optional: For the data we are dealing with, predictions with an allele frequency not equal to 1 are not really applicable. (The reference genome is haploid. There aren't any heterozygotes.) How can we remove these lines from the file?

Try looking at grep --help to see what you can come up with.

Here for answer
grep -v *something*  # The -v flag inverts the match effecitvely showing you everything that does not match your input
 Going farther
cat input.vcf | grep AF1=1 > output.vcf

Is not practical, since we will lose vital VCF formatting and may not be able to use this file in the future.

cat input.vcf | grep -v AF1=0 > output.vcf

Will preserve all lines that don't have a AF=0 value and is one way of doing this.

sed -i '/AF1=0/ d' input.vcf

Is a way of doing it in-line and not requiring you to make another file. (But it writes over your existing file!)

Optional Exercises

Calling variants in reads mapped by BWA or the improved read quality reads

Follow the same directions to call variants in the BWA or Bowtie2 mapped reads with the improved quality. Just be sure you don't write over your old files. Maybe create new directories like BDIB_samtools_bwa and BDIB_samtools_bowtie_improved for the output in each case. As another fun version control issue, using the module system you can see that bowtie actually has 2 versions available: 1.1.2 and 2.2.6 ... these version could not be more different to the point where they should really be named different things. bowtie 1.1.2 is included only to keep old scripts that rely on it running, and is surpassed in everyway by bowtie2. If so desired, you could figure out how to run it yourself as a test of learning an unfamiliar program, or you could include any the output from said run from the precanned location below:

If you do not have the output from the Mapping tutorial, run the first 4 commands to copy over the output that would have been produced. Then, you can immediately start this tutorial!

cds
mkdir BDIB_samtools_tutorial  # if directory already exists, don't worry it won't delete the current contents
cd BDIB_samtools_tutorial
cp -r $BI/gva_course/mapping/bowtie2 .
 
# optional commands for optional and more in-depth tutorials
cp -r $BI/gva_course/mapping/bowtie .
cp -r $BI/gva_course/mapping/bwa . 

These precanned results will be used in the optional upcoming bedtools tutorial as well, or you can simply compare the output .vcf files for more simple answers

Further Optional Exercises

  • Which mapper finds more variants?
  • Can you figure out how to filter the VCF files on various criteria, like coverage, quality, ... ?
  • How many high quality mutations are there in these E. coli samples relative to the reference genome?
  • Look at how the reads supporting these variants were aligned to the reference genome in the Integrative Genomics Viewer (IGV).
  • Look into more sophisticated variant calling with GATK. We recommend starting from the GATK best practice page.

Module  version of samtools

As suggested in the initial introduction, the point of this optional tutorial is to work through getting a different version of samtools to work (the command line expectations and flags are not as they were in version 0.1.18). To make sure you are starting in the right place:

tacc:~$ module load samtools
tacc:~$ which samtools
/opt/apps/samtools/1.3/bin/samtools

Good luck, and remember if you undertake this and get frustrated with it, it is a great learning experience and is by far the most difficult thing you will attempt. As part of the learning experience, feel free to contact us with any questions or problems you are specifically having with it, but cookbooked suggestions would defeat the intended purpose of beating your head against the problem to figure it out. You DO have the necessary skills to figure out how to do this now.

Return to GVA2016 course page.