Single Nucleotide Variant (SNV) calling Tutorial GVA2020


Overview:

SAMtools is a suite of commands for dealing with databases of mapped reads. You'll be using it quite a bit throughout the course. It includes programs for performing variant calling (mpileup-bcftools). This tutorial expects you have already completed the Mapping tutorial.

Learning Objectives

  1. Gain important insight into version control.
  2. Familiarize yourself with SAMtools.
  3. Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.

Loading SAMtools – a lesson in version control

One of the most important aspects of science is that it is supposed to be reproducible, and as mentioned in an earlier tutorial, a computer will always do exactly what it is told... the trick is telling it to do what you actually want it to do. As bench scientists, we all know (or will soon learn) that protocols change slightly over time... maybe you have had the nightmare troubleshooting experience of a reliable protocol suddenly giving unreliable results only to find out that an enzyme/reagent/kit you bought from a different vendor because it was cheeper is actually not identical in every way, or maybe you find a kit or reagent that claims better yield yet forces small differences in your protocol. Computational biology is no different in that protocols and programs change slightly over time (usually in the form of version updates). In the "best" case, version improvements add new functionality that do not change old analysis, in the worst of cases in an effort to fix small bugs (thereby increase accuracy by eliminating false positives in the eyes of the developers at least) in a way that makes you unaware that anything has changed other than your final output if you have to repeat your analysis (say because you added new samples to your cohort). Sometimes, programs will change drastically enough that even your old commands stop working. This is both a blessing and a curse. A blessing in that you are astutely aware that something has changed, and you are forced to either fix/update your analysis to the new version (typically gaining an understanding of what was changed), and a curse in that you have to figure out how to fix things even if this means continuing to use an older version.

As an optional extension of this tutorial you will have the opportunity to experience this first hand as you  have access to 2 different versions of samtools 1 of which works for this tutorial, and the other which does not.

First let's check if SAMtools is loaded. The easiest way to do this is to simply type samtools. (Remember that most programs/commands are in all lowercase (while scripts often have capital letters) despite their webpages having capital letters associated with them to make them stand out). Looking through the output you should see a line that reads:

Version: 1.6 (using htslib 1.6)

This is very important information for the most detailed reporting of your computational analysis, and reproducibility of said analysis. Sadly this level of reporting is often ignored or not appreciated by many journals leading to difficulty in reproducing results.

Some modules on TACC offer multiple different versions, and sadly (for many biological modules at least), the default version is not always the newest version. Can you use the module system to determine if there are other versions of samtools available?

 Click here for a hint without the answer

Remember we use the the base command "module" to list the installed modules, and find the available modules, and then spider to access more detailed information (including what versions are available).

click here to check your work, or get the answer
module list samtools
module avail samtools
module spider samtools

As you have used the module system several times now, you probably should have been at least in the right ballpark with the hint if not sooner.

You should notice that the only version of samtools that is available through the module system on tacc is version 1.6 and that the module is currently loaded because of what we put in the .bashrc folder in your $HOME directory yesterday. Try to unload the samtools module and see if you have any other versions of samtools available in other locations.

 Need a hint?
  1. The module command has several subcommands which can be found by typing module help.
  2. unload is of course the opposite of load
click here for the answer
module unload samtools


Somewhat unfortunately, there is no output given when a module is unloaded, even if you try to unload something that doesn't exist, or wasn't loaded currently:

Output for unloading modules
tacc:~$ module unload not-a_real_module
tacc:~$ module unload samtools
tacc:~$

Now if you reuse the module list samtools command you will get a response saying that there are no modules matching samtools currently loaded. Yet what happens when you again try the samtools command without any other information? You clearly see a different version 0.1.18. Can you figure out where version 0.1.18 is being loaded from?

 Need a hint?
  1. The which command searches through your path for an executable matching whatever comes after it, and returns the location that command is stored.
  2. Tab (and double tab) will work with the which command
 Answer:
tacc:~$ which samtools 
/corral-repl/utexas/BioITeam/breseq/bin/samtools

Hopefully, you recognize this as part of the BioITeam community resources.

So why is it when the module is loaded it finds the 1.6 version rather than the 0.1.18 version? The answer is thats what the $PATH variable is set to find. Consider the following information about the module system and the $PATH variable:

  1. When you type a command, only locations that are in your PATH variable are searched for an executable command matching that name.
  2. When the command is found in your PATH, the computer immediately substitutes the location it was found for the short command name you entered, and stops searching.
    1. This means that things that are early in your path are always searched first. In some extreme circumstances if you add a bunch of locations with lots of files to the front of your PATH, you can actually slow down your entire computer, so try to limit the path variable to only look in directories containing executable files.
  3. The module system always assumes that when you load a module, you intend to use it, and thus puts the executables for that module at the front of your PATH. This is one of the reasons we try to limit the number of modules we load by default (it puts other commands further back in the list). 
  4. In your .bashrc file, modules are loaded first (including samtools). 
  5. After modules are loaded, we further manipulate your PATH variable several times. The last section involving breseq has 2 alternative manipulations:
    1. The first which you can see we have commented out:

      # export PATH=$BI/breseq/bin:$PATH
      1. This command says make the variable PATH equal to the variable BI plus /breseq/bin and then add on the existing value of $PATH
    2. The second we actually use.

      export PATH=$PATH:$BI/breseq/bin
      1. This command says make the variable PATH equal to the existing value of $PATH variable and then add on BI plus /breseq/bin 
  6. One of the most important lessons you can ever learn

    Anytime you manipulate your PATH variable you always want to make sure that you include $PATH on the right side of the equation somewhere separated by : either before it, after it, or on both sides of it if you want it in the middle of 2 different locations. As we are explaining right now, there are reasons and times to put it in different relative places, but if you fail to include it (or include a typo in it by calling it say $PTAH) you can actually remove access to all existing commands including the most basic things like "ls" "mkdir" "cd".

 Reload the module version of samtools which is the correct one for the tutorial.

Simply reload samtools using the module system, check the version, and which version is now being used.

still stuck?
module load samtools
samtools  # check output for version information
which samtools
 
#required output:
/opt/apps/intel18/samtools/1.6/bin/samtools

If this doesn't seem familiar or make sense, get my attention on zoom.

As alluded to in the introduction, this tutorial is designed to run (and will actually only run) with one of these 2 versions. That version is the module version. At the end of this tutorial there is an optional suggestion to try to use the bioITeam version of samtools, but for now...

execute the following command and make sure you get the 2nd line as output:

tacc:~$ which samtools
/opt/apps/intel18/samtools/1.6/bin/samtools

If you see something different get my attention or the tutorial will not work.

Doing more with which

A lot of text was just devoted to the $PATH variable and how its manipulated and how to investigate it in a few different ways. This is because PATH variables are fairly common, especially when if start experimenting with multiple different programs that may have similar underlying requirements.

which has its own -h option that can be used to see what other potentially useful options it has
which -a samtools

When you have multiple results, the top line is the line that will be used unless you specify the entire path to the command. This can be useful in diagnosing what is going wrong particularly when you have an error message that says a command didn't work, and which -a tells you you have multiple different instances available.

Calling variants in reads mapped by bowtie2

Prepare your directories

Since the $SCRATCH directory on lonestar is effectively infinite for our purposes, we're going to copy the relevant files from our mapping tutorial into a new directory for this tutorial. This should help you identify what files came from what tutorial if you look back at it in the future. Let's copy over just the read alignment file in the SAM format and the reference genome in FASTA format to a new directory called GVA_samtools_tutorial.

Check your work or get the answer here
cds
mkdir GVA_samtools_tutorial
cd GVA_samtools_tutorial
cp $SCRATCH/GVA_bowtie2_mapping/bowtie2/SRR030257.sam .
cp $SCRATCH/GVA_bowtie2_mapping/NC_012967.1.fasta .

Unexpected output when you try to copy final files from mapping tutorial

If you see messages saying something similar to the following:

cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/bowtie2/SRR030257.sam': No such file or directory
cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/NC_012967.1.fasta': No such file or directory

It suggests something you either did not yet complete the mapping tutorial, or more likely, you stored these files in a different directory. If you think you completed the mapping tutorial, get my attention and be ready to share your screen and I'll try to help you find your missing files.

When copy commands execute successfully, the expected output is silent (no output at all)

Index the FASTA reference file

First, you need to index the reference file. (This isn't the same as indexing it for read mapping. It's indexing it so that SAMtools can quickly jump to a certain base in the reference.)

Command to index the reference file for SAMtools
samtools faidx NC_012967.1.fasta

Take a look at the new *.fai file that was created by this command see if you have any idea what some of the numbers mean. 

Alternative to head/tail/cat for examining a file without causing programs to crash
less NC_012967.1.fasta.fai  # can exit with "q"

As you can see, the less command also works perfectly well with files that are not in danger of crashing anything without cluttering your terminal  with lines of a file.

Convert mapped reads from SAM to BAM, sort, and index

SAM is a text file, so it is slow to access information about how any given read was mapped. SAMtools and many of the commands that we will run later work on BAM files (essentially GZIP compressed binary forms of the text SAM files). These can be loaded much more quickly. Typically, they also need to be sorted, so that when the program wants to look at all reads overlapping position 4,129,888, it can easily find them all at once without having to search through the entire BAM file.

The following 3 commands are used to:

  1. convert from SAM to BAM format
  2. sort the BAM file
  3. index the sorted BAM file

As you might guess this is computationally intense and as such must be iDEV node or submitted as a job (more on this on Friday). If you want to submit them to the job queue, you will want to separate them with a ";" to ensure that they run sequentially rather than simultaneously as each uses the output of the previous command. Under no circumstances should you run this on the head node.

Do not run on head node

Use showq -u to verify you are still on the idev node.

If not, and you need help getting a new idev node, see this tutorial.

Commands to be executed in order...
samtools view -b -S -o SRR030257.bam SRR030257.sam
samtools sort SRR030257.bam -o SRR030257.sorted.bam
samtools index SRR030257.sorted.bam
This is a really common sequence of commands, so you might want to add it to your personal cheat sheet.


Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:

List the contents of the output directory to see what new files have been created
ls -1  # This is the number 1 not the letter L. Several people seem to get confused about this each year, it is not necessary for any reason other than to give a single column of output regardless of window size.
Expected output
NC_012967.1.fasta
NC_012967.1.fasta.fai
SRR030257.bam
SRR030257.sam
SRR030257.sorted.bam
SRR030257.sorted.bam.bai
 Given what we saw above of fasta index file gaining a .fai ending, can you guess what a *.bai file is?

Sure enough, it's the index file for the BAM file.

You might be tempted to gzip BAM files when copying them from one computer to another. Don't bother! They are already internally compressed, so you won't be able to shrink the file. Further, to the best of my knowledge, no programs accept a gzipped bam file as a format to use.

On the other hand, compressing SAM files will save a fair bit of space, but the conversion between bam back to sam is pretty simple. Making storage of bam files likely a better decision.

Call genome variants

Now we use the mpileup command from samtools to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF).

This is one command. Put it all on one line.
samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf
 What are all the options doing? Try calling samtools mpileup without any options to see if you can figure it out before clicking for the answer
Optionpurpose
-ugenerates uncompressed BCF output
-f NC_012967.1.fasta

reference sequence file that has a corresponding faidx index .fai file

SRR030257.sorted.bamBAM input file to calculate pileups from
> SRR030257.bcfDirect output to SRR030257.bcf


The samtools mpileup command will take a few minutes to run. As practice for a fairly common occurrence when working with the iDEV environment, once the command is running, you could try putting it in the background by pressing control-z and then typing the command bg so that you can do some other things in this terminal window at the same time. Remember, there are still many other processors available on this node for you to do other things! Just remember that if you have something running in the background you need to check in with it to see if it is still running with the ps command or watch the command line every time you execute a new command as you may see information about your background task having finished.

Convert genome variants to human readable format

Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using the bcftools command (the which command will tell you where this command is located and examination of that path should tell you how you have access to it).

Use the which command to determine when we installed bcftools and then convert the .bcf file into the human readable .vcf version
which bcftools
bcftools call -v -c SRR030257.bcf > SRR030257.vcf
 How did we install bcftools?

which bcftools returns the following line:

/opt/apps/intel18/samtools/1.6/bin/bcftools

You may recognize from above that it is the same directory samtools is found in when you load samtools with the module system.


What are these options doing?

 See if you can start with the base command "bcftools" and figure out what each option above is doing on the bcftools command we used above. Click here when you think you know.
Optionpurpose
callspecific command to be executed by bcftools for calling SNP and indels
-v

output potential variant sites only

-cconsensus calling
SRR030257.bcf
input bcf file
> SRR030257.vcf
output as a vcf file

Take a look at the SRR030257.vcf file using less. It has a nice header explaining what the columns mean, including answers to some of your questions from yesterday's presentations. Below this are the rows of data describing potential genetic variants.

Analyzing variants detected

VCF format has alternative Allele Frequency tags denoted by AF= Try the following command to see what frequency our variants exist at.

grep AF1 SRR030257.vcf

If you look at the AF1= values you will all the lines are either ~ 0.5, or 1.

You can see that even easier with this handy awk 1 liner
 awk -F";" '{for(i=1;i<=NF;i++){if ($i ~ /AF1/){print $i}}}' SRR030257.vcf

^splits each line into columns based on where the ";"s, then searches through each column, if the "AF1" is found in the column, that column is printed. From the output it is even clearer that frequencies are coming up.

This is a pretty useful awk 1 liner to keep track of for pulling a single column out when you know it will have an identifier but not what column it will occur in; just substitute what you expect to find for "AF1" and whatever you want to use to break the line up for ";" (with \t for tab and , being the 2 most common options for tsv and csv files respectively).
The above result can be piped into sort to sort the lines into increasing order
 awk -F";" '{for(i=1;i<=NF;i++){if ($i ~ /AF1/){print $i}}}' SRR030257.vcf | sort
 Did you see anything when you were running the commands in this tutorial to suggest that it was looking for diploid genomes?

After you ran the bcftools call command you saw: "Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid". Just like on a webpage you can use control/command + F to find specific text in the window. Lookf for 'diploid' and you should see the line referenced above.

Obviously this suggests a way that you could go back and reanalyze this data introducing one of the recommended flags to the bcftools call command and see how this might effect your analysis. If you choose to do this, I suggest adding descriptive file name between 'SRR030257' and '.vcf' to make the results easier to compare.

 Rather than (or in addition to) going back and changing the analysis to look for variation from our haploid genome, can you come up with a way to create a filtered .vcf file that only has variants with a frequency of 1?

An initial attempt might be something like this:

Note these 2 examples will give the exact same output. The first command gives you extra practice with piping, and may help order your thoughts somewhat.
cat SRR030257.vcf | grep AF1=1 > SRR030257.filtered.vcf


grep AF1=1 SRR030257.vcf > SRR030257.filtered.vcf

This is does give us exactly what we asked for: all the lines that show a variant allele frequency of 1. Unfortunately, we lost all the useful header information at the top of the original  SRR030257.vcf file. 

 Can you come up with a way to include this information possibly by looking at the help information on grep?

From 'grep -h' you can see the that the '-v' inverts the match so it will print everything EXCEPT for lines that match.

2 equivalent answers again
cat SRR030257.vcf | grep -v AF1=0 > SRR030257.filtered.vcf

grep -v AF1=0 SRR030257.vcf > SRR030257.filtered.vcf

Will preserve all lines that don't have 'AF1=0' value on the line and is one way of doing this. If you look closely at the non-filtered file you will see that the frequencies are given as AF1=0.### so by filtering out lines that have 'AF1=0' in them we get rid of all frequencies that are not 1, including say 'AF1=0.99'. 

 How you would change this to variants that have a frequency of at least 90%?

Remember from our Raw Sequencing Data tutorial yesterday that we can group certain characters together by placing them between square brackets [].

2 equivalent answers again
cat SRR030257.vcf | grep -v AF1=0.[0-8] > SRR030257.filtered.vcf

grep -v AF1=0.[0-8] SRR030257.vcf > SRR030257.filtered.vcf

Here we added a decimal point after the 0, and then allowed for a match to any digit between 0 and 8. Thus lines that have AF1=1 would not match, nor would a line with AF1=0.9 .


Return to GVA2020 course page.


Optional Exercises at the end of class or for Wednesday/Thursday choose your own tutorial time.

Calling variants in trimmed reads.

  1. Trim both Read1 and Read2 using info from read preprocessing tutorial.
  2. Map reads with bowtie2 using info from read mapping tutorial.
  3. Call variants using this tutorial.

Remember in the intro tutorial we talked about file/directory naming. Be sure you don't write over your old files. Maybe create a new directories like GVA_samtools_bowtie_improved for the outputs.

Further Optional Exercises

  • Which mapper finds more variants?
  • Can you figure out how to filter the VCF files on various criteria, like coverage, quality, ... ?
  • How many high quality mutations are there in these E. coli samples relative to the reference genome?
  • Look at how the reads supporting these variants were aligned to the reference genome in the Integrative Genomics Viewer (IGV). This will be a separate tutorial for tomorrow.

Other version of samtools

As suggested in the initial introduction, the point of this optional tutorial is to work through getting a different version of samtools to work (the command line expectations, flags, and subcommands (ie bcftools call) were not what they are now in version 0.1.18). To make sure you are starting in the right place:

tacc:~$ module unload samtools
tacc:~$ which samtools
/corral-repl/utexas/BioITeam/breseq/bin/samtools

 
mkdir BDIB_samtools_old_version_tutorial
cd BDIB_samtools_old_version_tutorial
cp $SCRATCH/BDIB_bowtie2_mapping/bowtie2/SRR030257.sam .
cp $SCRATCH/BDIB_bowtie2_mapping/NC_012967.1.fasta .

Good luck, and remember if you undertake this and get frustrated with it, it is a great learning experience and is probably the most difficult exercise/tutorial in this class. As part of the learning experience, feel free to contact me with any questions or problems you are specifically having with it, but cookbooked suggestions would defeat the intended purpose of beating your head against the problem to figure it out. You DO have the necessary skills to figure out how to do this now. 

 A final hint

The 2016 class did this tutorial in the opposite manner where the 0.1.18 version was given as the walkthrough rather than the 1.6 version. I strongly suggest you use that tutorial only as a way to check your answers, but that tutorial can be found here.


Return to GVA2020 course page.