Single Nucleotide Variant (SNV) calling Tutorial GVA2023
Overview:
SAMtools is a suite of commands for dealing with databases of mapped reads. You'll be using it quite a bit throughout the course. It includes programs for performing variant calling (mpileup-bcftools). This tutorial expects you have already completed the Mapping tutorial.
Learning Objectives
- Work with a more complex conda installation, and how to troubleshoot it.
- Familiarize yourself with SAMtools.
- Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.
Installing SAMtools
As we have done with: fastqc, cutadapt, and bowtie2, we want to install samtools and bcftools into a new environment (we'll call this one GVA-SNV). Once again, having access to conda-forge will be required to install the most recent version.
samtools --version bcftools --version
samtools --version output:
samtools 1.17 Using htslib 1.17 Copyright (C) 2023 Genome Research Ltd. # Followed by a bunch of compilation details.
bcftools --version output:
bcftools 1.17 Using htslib 1.17 Copyright (C) 2023 Genome Research Ltd. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.
If you get something else, such as an error message with bcftools --version:
"bcftools: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory"
do not proceed, check your conda install command and if it matches the one listed above, get my attention.
Calling variants in reads mapped by bowtie2
Prepare your directories
Since the $SCRATCH directory on stampede2 is effectively infinite for our purposes, we're going to copy the relevant files from our mapping tutorial into a new directory for this tutorial. This should help you identify what files came from what tutorial if you look back at it in the future. Let's copy over just the read alignment file in the SAM format and the reference genome in FASTA format to a new directory called GVA_samtools_tutorial.
Unexpected output when you try to copy final files from mapping tutorial
If you see messages saying something similar to the following:
cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/bowtie2/SRR030257.sam': No such file or directory cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/NC_012967.1.fasta': No such file or directory
It suggests something you either did not yet complete the mapping tutorial, or more likely, you stored these files in a different directory. If you think you completed the mapping tutorial, get my attention and be ready to share your screen and I'll try to help you find your missing files.
When copy commands execute successfully, the expected output is silent (no output at all)
Index the FASTA reference file
Assuming you have the above output for samtools --version and bcftools --version (both 1.17), first, you need to index the reference file. (This isn't the same as indexing it for read mapping. It's indexing it so that SAMtools can quickly jump to a certain base in the reference.)
samtools faidx NC_012967.1.fasta
Take a look at the new *.fai file that was created by this command see if you have any idea what some of the numbers mean.
less NC_012967.1.fasta.fai # can exit with "q"
As you can see, the less command also works perfectly well with files that are not in danger of crashing anything without cluttering your terminal with lines of a file.
Convert mapped reads from SAM to BAM, sort, and index
SAM is a text file, so it is slow to access information about how any given read was mapped. SAMtools and many of the commands that we will run later work on BAM files (essentially GZIP compressed binary forms of the text SAM files). These can be loaded much more quickly. Typically, they also need to be sorted, so that when the program wants to look at all reads overlapping position 4,129,888, it can easily find them all at once without having to search through the entire BAM file.
The following 3 commands are used to:
- convert from SAM to BAM format
- sort the BAM file
- index the sorted BAM file
As you might guess this is computationally intense and as such must be iDEV node or submitted as a job (more on this on Friday). If you want to submit them to the job queue, you will want to separate them with a ";" to ensure that they run sequentially rather than simultaneously as each uses the output of the previous command. Under no circumstances should you run this on the head node.
Do not run on head node
Use hostname to verify you are still on the idev node. expect to see a computer number (NOT a login number) in front of
stampede2.tacc.utexas.edu
If not, and you need help getting a new idev node, see this tutorial.
samtools view --threads 48 -b -S -o SRR030257.bam SRR030257.sam samtools sort --threads 48 SRR030257.bam -o SRR030257.sorted.bam samtools index -@ 48 SRR030257.sorted.bam
Cheat Sheat
It is expected that the first command generate no output, the second command to generate a single line from the bam_sort_core regarding files and memory blocks, and the third line to again generate no output. While the first 2 commands take a few minutes each and the third command is very quick on a single thread, with 48 additional threads all complete very quickly.
Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:
NC_012967.1.fasta NC_012967.1.fasta.fai SRR030257.bam SRR030257.sam SRR030257.sorted.bam SRR030257.sorted.bam.bai
You might be tempted to gzip
BAM files when copying them from one computer to another. Don't bother! They are already internally compressed, so you won't be able to shrink the file. Further, to the best of my knowledge, no programs accept a gzipped bam file as a format to use.
On the other hand, compressing SAM files will save some space, but the conversion between bam back to sam is pretty simple/quick. Making storage of bam files likely a better decision.
Call genome variants
Now we use the mpileup
command from samtools
to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF). For more information about VCF files: https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/
bcftools mpileup bcftools mpileup --threads 48 -O u -f NC_012967.1.fasta SRR030257.sorted.bam -o SRR030257.bcf
Historical command (now depreciated) which used to give nearly the same output as the bcftools mpileup command
Last year, in addition to the bcftools mpileup command listed above, a second command a second command using samtools mpileup was listed as an option. This was a very common command structure that you may come across elsewhere (samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf
). note that besides using the program samtools instead of bcftools, the only differences are the use of -u
instead of -O u
, and piping the output (>
) to the SRR30257.bcf file instead of naming the file with a -o
flag.
Last year it was noted that if you tried the samtools command, there was a warning stating that:
"samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future."
Further, I warned that adjusting to the new command would have value as typically once programers start warning that functionality is "depreciated" it is only a matter of time before it is "no longer supported" and then just flat out "broken". Sure enough, in less than a year's time, it is no longer working. This is one of the reasons why updating to the newest version of a program is not always recommended if the version you are using is working for you (more on Friday).
Sending programs to the background.
The samtools mpileup
command will take a few minutes to run even with 48 threads. If you have read through the information about the different options, as practice for a fairly common occurrence when working with the iDEV environment, you could try putting it in the background by pressing control-z
and then typing the command bg
so that you can do some other things in this terminal window at the same time. Remember, there are still many other processors available on this node for you to do other things! Just remember that if you have something running in the background you need to check in with it to see if it is still running with the ps
command or watch the command line every time you execute a new command as you may see information about your background task having finished.
The fg
command (foreground) is the opposite of the bg
(background) command. If you want to return your command to your active prompt so you are notified directly when the command finishes (or errors) simply type 'fg
' assuming you only have 1 job running in the background.
Convert genome variants to human readable format
Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using a bcftools command.
bcftools call -v -c SRR030257.bcf > SRR030257.vcf
What are these options doing?
Take a look at the SRR030257.vcf
file using less
. It has a nice header explaining what the columns mean, including answers to some of your questions from yesterday's presentations. https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/ can be used to figure out the columns are and what types of information they provide. Below this are the rows of data describing potential genetic variants.
Analyzing variants detected
VCF format has alternative Allele Frequency tags denoted by AF= Try the following command to see what frequency our variants exist at.
grep AF1 SRR030257.vcf
If you look at the AF1= values you will see all the lines are either ~ 0.5, or 1.
awk -F";" '{for(i=1;i<=NF;i++){if ($i ~ /AF1/){print $i}}}' SRR030257.vcf
^splits each line into columns based on where the ";"s, then searches through each column, if the "AF1" is found in the column, that column is printed. From the output it is even clearer that frequencies are coming up.
Cheat Sheat
-F";"
(with -F"\t"
for tab or -F","
being the 2 most common options for tsv and csv files respectively).awk -F";" '{for(i=1;i<=NF;i++){if ($i ~ /AF1/){print $i}}}' SRR030257.vcf | sort
This is does give us exactly what we asked for: all the lines that show a variant allele frequency of 1. Unfortunately, we lost all the useful header information at the top of the original SRR030257.vcf file.
Will preserve all lines that don't have 'AF1=0' value on the line and is one way of doing this. If you look closely at the non-filtered file you will see that the frequencies are given as AF1=0.### so by filtering out lines that have 'AF1=0' in them we get rid of all frequencies that are not 1, including say 'AF1=0.99'.
Return to GVA2023 course page.
Optional Exercises at the end of class or for Wednesday/Thursday choose your own tutorial time.
Calling variants in trimmed reads.
- Trim both Read1 and Read2 using info from read preprocessing tutorial.
- Map reads with bowtie2 using info from read mapping tutorial.
- Call variants using this tutorial.
Remember in the intro tutorial we talked about file/directory naming. Be sure you don't write over your old files. Maybe create a new directories like GVA_samtools_bowtie_improved
for the outputs.
Further Optional Exercises
- If you use additional mapping programs, which mapper finds more variants?
- Can you figure out how to filter the VCF files on various criteria, like coverage, quality, ... ?
- How many high quality mutations are there in these E. coli samples relative to the reference genome?
- Look at how the reads supporting these variants were aligned to the reference genome in the Integrative Genomics Viewer (IGV). This will be a separate tutorial for tomorrow.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache.