Single Nucleotide Variant (SNV) calling Tutorial GVA2021


Overview:

SAMtools is a suite of commands for dealing with databases of mapped reads. You'll be using it quite a bit throughout the course. It includes programs for performing variant calling (mpileup-bcftools). This tutorial expects you have already completed the Mapping tutorial.

Learning Objectives

  1. Work with a more complex conda installation, and how to troubleshoot it.
  2. Familiarize yourself with SAMtools.
  3. Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.

Installing SAMtools 

As we have done with: fastqc, cutadapt, and bowtie2, we want to install samtools. Unlike with the previous tools, there is a difficulty this time. If you use and arrive at https://anaconda.org/bioconda/samtools you will easily assume that the correct command you need is: conda install -c bioconda samtools as this has been the command that has worked for our other tools so far. Instead the correct command is actually: conda install -c bioconda samtools bcftools openssl=1.0 

There are 2 different things going on in this command.

  1. Forcing the installation of a specific version of openssl. In this case, a lower version than would normally be installed if samtools were installed by itself. According to https://github.com/bioconda/bioconda-recipes/issues/12100 my understanding is that when the conda package was put together there is an error wherein samtools specifically says to get version 1.1 of openssl, but the samtools program specifically requires version 1.0 to be present.
  2. We are installing both samtools and bcftools at the same time. This can clean up some installation problems when there are conflicts between individual packages and you want to use them in a single environment. An alternative would be to have a samtools environment and a bcftools environment, but that creates unnecessary steps of having to change environments in the middle of your analysis.


 Click here to expand and see what the outcome of the assumed installation command is, what the problem is, and steps you could take to fix it.

This box contains example commands and outputs showing you something that does NOT work for educational and diagnostic purposes. If you use the code listed in this box, be sure you use ALL the code or you may run into downstream problems with this tutorial.


conda install -c bioconda samtools
samtools --version

The above command will appear to install correctly as other programs have, but the second command which you would expect to show you the version of samtools instead returns the following error:

samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

Googling the entire error the top results clearly mention conda and several pages list problems associated "fixes" with different conda installation commands.

Some of the "fixes" involve adding both bioconda and conda-forge to your channel list and forcing specific access orders. As noted earlier, adding channels to the default search list is something you can consider, but has the drawback of leading you to potentially install something you didn't mean to making me a fan of being explicit in what channels I'm accessing. More on this on Friday when we discuss what tools to use.

Other "fixes" involve commands that may work, but at the expense of altering existing packages in our environment (conda install -c bioconda samtools=1.9 --force-reinstall) while this command may work based on community feedback, the number of programs that would be downgraded is concerning to me. If I were going to go this route, I would instead copy my existing conda environment to a new "test" environment (conda create --name GVA2021-samtools-test --clone GVA2021), run the struck-through command listed above, and make sure it works as expected before weighing what exact programs were being downgraded and what the immediate effects would be on other analysis. Most likely I would ultimately rename the environment to something more permanent, and keep samtools as a separate step (conda create --name GVA2021-samtools --clone GVA2021-samtools-test; conda env remove --name GVA2021-samtools-test)

Rather than going through all that. My solution is simply to install an older version of samtools deliberately from the start as several webpages (including: https://github.com/bioconda/bioconda-recipes/issues/13958 suggest that the issue is specific to the 1.12 version). In order to do this, I first had to remove the existing (incorrect) samtools version.

conda remove samtools
conda install -c bioconda samtools==1.11
samtools --version

which now gives a reasonable output of:

samtools 1.11
Using htslib 1.11
Copyright (C) 2020 Genome Research Ltd.

Unfortunately, this would then create a downstream problems with installing bcftools and a different set of conflicts. Therefore we will again remove our (now functioning) samtools package, and install samtools, bcftools, and openssl version 1.0 in a single command.

conda remove samtools
conda install -c bioconda samtools bcftools openssl=1.0
Make sure you have the expected versions of samtools and bcftools installed
samtools --version
bcftools --version


samtools --version output:

samtools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.

bcftools --version output:

bcftools 1.9
Using htslib 1.9
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Calling variants in reads mapped by bowtie2

Prepare your directories

Since the $SCRATCH directory on lonestar is effectively infinite for our purposes, we're going to copy the relevant files from our mapping tutorial into a new directory for this tutorial. This should help you identify what files came from what tutorial if you look back at it in the future. Let's copy over just the read alignment file in the SAM format and the reference genome in FASTA format to a new directory called GVA_samtools_tutorial.

Check your work or get the answer here
cds
mkdir GVA_samtools_tutorial
cd GVA_samtools_tutorial
cp $SCRATCH/GVA_bowtie2_mapping/bowtie2/SRR030257.sam .
cp $SCRATCH/GVA_bowtie2_mapping/NC_012967.1.fasta .

Unexpected output when you try to copy final files from mapping tutorial

If you see messages saying something similar to the following:

cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/bowtie2/SRR030257.sam': No such file or directory
cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/NC_012967.1.fasta': No such file or directory

It suggests something you either did not yet complete the mapping tutorial, or more likely, you stored these files in a different directory. If you think you completed the mapping tutorial, get my attention and be ready to share your screen and I'll try to help you find your missing files.

When copy commands execute successfully, the expected output is silent (no output at all)

Index the FASTA reference file

Assuming you have the above output for samtools --version and bcftools --version (both 1.9), first, you need to index the reference file. (This isn't the same as indexing it for read mapping. It's indexing it so that SAMtools can quickly jump to a certain base in the reference.)

Command to index the reference file for SAMtools
samtools faidx NC_012967.1.fasta

Take a look at the new *.fai file that was created by this command see if you have any idea what some of the numbers mean. 

Alternative to head/tail/cat for examining a file without causing programs to crash
less NC_012967.1.fasta.fai  # can exit with "q"

As you can see, the less command also works perfectly well with files that are not in danger of crashing anything without cluttering your terminal  with lines of a file.

Convert mapped reads from SAM to BAM, sort, and index

SAM is a text file, so it is slow to access information about how any given read was mapped. SAMtools and many of the commands that we will run later work on BAM files (essentially GZIP compressed binary forms of the text SAM files). These can be loaded much more quickly. Typically, they also need to be sorted, so that when the program wants to look at all reads overlapping position 4,129,888, it can easily find them all at once without having to search through the entire BAM file.

The following 3 commands are used to:

  1. convert from SAM to BAM format
  2. sort the BAM file
  3. index the sorted BAM file

As you might guess this is computationally intense and as such must be iDEV node or submitted as a job (more on this on Friday). If you want to submit them to the job queue, you will want to separate them with a ";" to ensure that they run sequentially rather than simultaneously as each uses the output of the previous command. Under no circumstances should you run this on the head node.

Do not run on head node

Use hostname to verify you are still on the idev node. (expect to see a computer number NOT a login number in front of 

stampede2.tacc.utexas.edu

If not, and you need help getting a new idev node, see this tutorial.

Commands to be executed in order...
samtools view -b -S -o SRR030257.bam SRR030257.sam
samtools sort SRR030257.bam -o SRR030257.sorted.bam
samtools index SRR030257.sorted.bam
This is a really common sequence of commands, so you might want to add it to your personal cheat sheet if you are keeping one.

It is expected that the first command generate no output, the second command to generate a single line from the bam_sort_core regarding files and memory blocks, and the third line to again generate no output. While the first 2 commands take a minute or two, the third command is very quick.

Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:

List the contents of the output directory to see what new files have been created
ls -1  # This is the number 1 not the letter L. Several people seem to get confused about this each year, it is not necessary for any reason other than to give a single column of output regardless of window size.
Expected output
NC_012967.1.fasta
NC_012967.1.fasta.fai
SRR030257.bam
SRR030257.sam
SRR030257.sorted.bam
SRR030257.sorted.bam.bai
 Given what we saw above of fasta index file gaining a .fai ending, can you guess what a *.bai file is?

Sure enough, it's the index file for the BAM file.

You might be tempted to gzip BAM files when copying them from one computer to another. Don't bother! They are already internally compressed, so you won't be able to shrink the file. Further, to the best of my knowledge, no programs accept a gzipped bam file as a format to use.

On the other hand, compressing SAM files will save a fair bit of space, but the conversion between bam back to sam is pretty simple. Making storage of bam files likely a better decision.

Call genome variants

Now we use the mpileup command from samtools to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF). For more information about VCF files: https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/

You should only execute one of these commands
samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf
bcftools mpileup -O u -f NC_012967.1.fasta SRR030257.sorted.bam -o bcftools.SRR030257.bcf

While the first command will generate a warning stating that "samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future." and finish running in ~10 minutes. The corresponding mpileup command which generates nearly identical output, takes >35 minutes to complete. If you are working on this tutorial during class time on Tuesday, you should likely choose the first option. If you have circled back later in the week or outside class time, adjusting to the new command would have value as typically once programers start warning that functionality is "depreciated" it is only a matter of time before it is "no longer supported" and then just flat out "broken". This is one of the reasons why updating to the newest version of a program is not always recommended if the version you are using is working for you (more on Friday).

 What are all the options doing? Try calling samtools mpileup without any options to see if you can figure it out before clicking for the answer
Optionpurpose
-ugenerates uncompressed BCF output
-f NC_012967.1.fasta

reference sequence file that has a corresponding faidx index .fai file

SRR030257.sorted.bamBAM input file to calculate pileups from
> SRR030257.bcf

Direct output to SRR030257.bcf file, rather than printing to the screen

-O ulike -u in the samtools command, generates uncompressed BCF output
-o SRR030257.bcfOutput file SRR030257.bcf


The samtools mpileup command will take a few minutes to run. As practice for a fairly common occurrence when working with the iDEV environment, once the command is running, you could try putting it in the background by pressing control-z and then typing the command bg so that you can do some other things in this terminal window at the same time. Remember, there are still many other processors available on this node for you to do other things! Just remember that if you have something running in the background you need to check in with it to see if it is still running with the ps command or watch the command line every time you execute a new command as you may see information about your background task having finished. Alternatively, you could go back to the beginning of this tutorial and read through some examples of trying to install other versions of samtools and troubleshooting conda installations while you wait for the command to finish. 

The fg command (foreground) is the opposite of the bg (background) command. If you want to return your command to your active prompt so you are notified directly when the command finishes (or errors) simply type 'fg' assuming you only have 1 job running in the background.


Convert genome variants to human readable format

Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using a bcftools command.

bcftools call -v -c SRR030257.bcf > SRR030257.vcf

What are these options doing?

 See if you can start with the base command "bcftools" and figure out what each option above is doing on the bcftools command we used above. Click here when you think you know.
Optionpurpose
callspecific subcommand to be executed by bcftools for calling SNP and indels
-v

output potential variant sites only

-cconsensus calling
SRR030257.bcf
input bcf file
> SRR030257.vcf
output as a vcf file

Take a look at the SRR030257.vcf file using less. It has a nice header explaining what the columns mean, including answers to some of your questions from yesterday's presentations. https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/ can be used to figure out the columns are and what types of information they provide. Below this are the rows of data describing potential genetic variants.

Analyzing variants detected

VCF format has alternative Allele Frequency tags denoted by AF= Try the following command to see what frequency our variants exist at.

grep AF1 SRR030257.vcf

If you look at the AF1= values you will all the lines are either ~ 0.5, or 1.

You can see that even easier with this handy awk 1 liner
 awk -F";" '{for(i=1;i<=NF;i++){if ($i ~ /AF1/){print $i}}}' SRR030257.vcf

^splits each line into columns based on where the ";"s, then searches through each column, if the "AF1" is found in the column, that column is printed. From the output it is even clearer that frequencies are coming up.

This is a pretty useful awk 1 liner to keep track of for pulling a single column out when you know it will have an identifier but not what column it will occur in; just substitute what you expect to find for "AF1" and whatever you want to use to break the line up for ";" (with \t for tab and , being the 2 most common options for tsv and csv files respectively).
The above result can be piped into sort to sort the lines into increasing order
 awk -F";" '{for(i=1;i<=NF;i++){if ($i ~ /AF1/){print $i}}}' SRR030257.vcf | sort
 Did you see anything when you were running the commands in this tutorial to suggest that it was looking for diploid genomes?

After you ran the bcftools call command you saw: "Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid". Just like on a webpage you can use control/command + F to find specific text in the window. Look for 'diploid' and you should see the line referenced above.

Obviously this suggests a way that you could go back and reanalyze this data introducing one of the recommended flags to the bcftools call command and see how this might effect your analysis. If you choose to do this, I suggest adding descriptive file name between 'SRR030257' and '.vcf' to make the results easier to compare.

 Rather than (or in addition to) going back and changing the analysis to look for variation from our haploid genome, can you come up with a way to create a filtered .vcf file that only has variants with a frequency of 1?

An initial attempt might be something like this:

Note these 2 examples will give the exact same output. The first command gives you extra practice with piping, and may help order your thoughts somewhat.
cat SRR030257.vcf | grep AF1=1 > SRR030257.filtered.vcf


grep AF1=1 SRR030257.vcf > SRR030257.filtered.vcf

This is does give us exactly what we asked for: all the lines that show a variant allele frequency of 1. Unfortunately, we lost all the useful header information at the top of the original  SRR030257.vcf file. 

 Can you come up with a way to include this information possibly by looking at the help information on grep?

From 'grep -h' you can see the that the '-v' inverts the match so it will print everything EXCEPT for lines that match.

2 equivalent answers again
cat SRR030257.vcf | grep -v AF1=0 > SRR030257.filtered.vcf

grep -v AF1=0 SRR030257.vcf > SRR030257.filtered.vcf

Will preserve all lines that don't have 'AF1=0' value on the line and is one way of doing this. If you look closely at the non-filtered file you will see that the frequencies are given as AF1=0.### so by filtering out lines that have 'AF1=0' in them we get rid of all frequencies that are not 1, including say 'AF1=0.99'. 

 How you would change this to variants that have a frequency of at least 90%?

Remember from our Raw Sequencing Data tutorial yesterday that we can group certain characters together by placing them between square brackets [].

2 equivalent answers again
cat SRR030257.vcf | grep -v AF1=0.[0-8] > SRR030257.filtered.vcf

grep -v AF1=0.[0-8] SRR030257.vcf > SRR030257.filtered.vcf

Here we added a decimal point after the 0, and then allowed for a match to any digit between 0 and 8. Thus lines that have AF1=1 would not match, nor would a line with AF1=0.9 .


Return to GVA2021 course page.


Optional Exercises at the end of class or for Wednesday/Thursday choose your own tutorial time.

Calling variants in trimmed reads.

  1. Trim both Read1 and Read2 using info from read preprocessing tutorial.
  2. Map reads with bowtie2 using info from read mapping tutorial.
  3. Call variants using this tutorial.

Remember in the intro tutorial we talked about file/directory naming. Be sure you don't write over your old files. Maybe create a new directories like GVA_samtools_bowtie_improved for the outputs.

Further Optional Exercises

  • Which mapper finds more variants?
  • Can you figure out how to filter the VCF files on various criteria, like coverage, quality, ... ?
  • How many high quality mutations are there in these E. coli samples relative to the reference genome?
  • Look at how the reads supporting these variants were aligned to the reference genome in the Integrative Genomics Viewer (IGV). This will be a separate tutorial for tomorrow.


Return to GVA2021 course page.