...
SAMtools is a suite of commands for dealing with databases of mapped reads. You'll be using it quite a bit throughout the course. It includes programs for performing variant calling (mpileup-bcftools). This tutorial expects you have already completed the Mapping tutorial.
Learning Objectives
- Familiarize yourself with SAMtools.
- Gain important insight into version control.
- Familiarize yourself with SAMtools.
- Use SAMtools to identify variants in the E. coli genomes we mapped in the previous tutorial.
Calling variants in reads mapped by bowtie2
Right now, we'll be using it to call variants (find mutations) in the re-sequenced E. coli genome from the Mapping tutorial. You will need the output SAM files from that tutorial to continue here. If you wish to start this tutorial without completing the Mapping Tutorial, see the bottom section of this page for information about downloading canned data.
We assume that you are still working in the main directory called GVA_bowtie2_mapping that you created on $SCRATCH
.
Loading SAMtools – a lesson in version control
...
Loading SAMtools – a lesson in version control
One of the most important aspects of science is that it is supposed to be reproducible, and as mentioned in an earlier tutorial, a computer will always do exactly what it is told... the trick is telling it to do what you actually want it to do. As bench scientists, we all know (or will soon learn) that protocols change slightly over time... maybe you have had the nightmare troubleshooting experience of a reliable protocol suddenly giving unreliable results only to find out that an enzyme/reagent/kit you bought from a different vendor because it was cheeper is actually not identical in every way, or maybe you find a kit or reagent that claims better yield yet forces small differences in your protocol. Computational biology is no different in that protocols and programs change slightly over time (usually in the form of version updates). In the "best" case, version improvements add new functionality that do not change old analysis, in the worst of cases in an effort to fix small bugs (thereby increase accuracy by eliminating false positives in the eyes of the developers at least) in a way that makes you unaware that anything has changed other than your final output if you have to repeat your analysis (say because you added new samples to your cohort). Sometimes, programs will change drastically enough that even your old commands stop working. This is both a blessing and a curse. A blessing in that you are astutely aware that something has changed, and you are forced to either fix/update your analysis to the new version (typically gaining an understanding of what was changed), and a curse in that you have to figure out how to fix things even if this means continuing to use an older version.
...
Tip | |||||||
---|---|---|---|---|---|---|---|
| |||||||
A lot of text was just devoted to the $PATH variable and how its manipulated and how to investigate it in a few different ways. This is because PATH variables are fairly common, especially when if start experimenting with multiple different programs that may have similar underlying requirements.
When you have multiple results, the top line is the line that will be used unless you specify the entire path to the command. This can be useful in diagnosing what is going wrong particularly when you have an error message that says a command didn't work, and |
Calling variants in reads mapped by bowtie2
Prepare your directories
Since the $SCRATCH directory on lonestar is effectively infinite for our purposes, we're going to copy the relevant files from our mapping tutorial into a new directory for this tutorial. This should help you identify what files came from what tutorial if you look back at it in the future. Let's copy over just the read alignment file in the SAM format and the reference genome in FASTA format to a new directory called GVA_samtools_tutorial.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds mkdir GVA_samtools_tutorial cd GVA_samtools_tutorial cp $SCRATCH/GVA_bowtie2_mapping/bowtie2/SRR030257.sam . cp $SCRATCH/GVA_bowtie2_mapping/NC_012967.1.fasta . |
Index the FASTA reference file
First, you need to index the reference file. (This isn't the same as indexing it for read mapping. It's indexing it so that SAMtools can quickly jump to a certain base in the reference.)
Code Block | ||||
---|---|---|---|---|
| ||||
Warning | ||||
| ||||
If you see messages saying something similar to the following:
|
Take a look at the new *.fai file that was created by this command see if you have any idea what some of the numbers mean.
Code Block | ||||
---|---|---|---|---|
| ||||
less NC_012967.1.fasta.fai # can exit with "q"
|
As you can see, the less command also works perfectly well with files that are not in danger of crashing anything without cluttering your terminal with lines of a file.
Convert mapped reads from SAM to BAM, sort, and index
SAM is a text file, so it is slow to access information about how any given read was mapped. SAMtools and many of the commands that we will run later work on BAM files (essentially GZIP compressed binary forms of the text SAM files). These can be loaded much more quickly. Typically, they also need to be sorted, so that when the program wants to look at all reads overlapping position 4,129,888, it can easily find them all at once without having to search through the entire BAM file.
The following 3 commands are used to:
- convert from SAM to BAM format
- sort the BAM file
- index the sorted BAM file
As you might guess this is computationally intense and as such must be iDEV node or submitted as a job (more on this on Friday). If you want to submit them to the job queue, you will want to separate them with a ";" to ensure that they run sequentially rather than simultaneously as each uses the output of the previous command. Under no circumstances should you run this on the head node.
Warning | ||
---|---|---|
| ||
Use showq -u to verify you are still on the idev node. If not, and you need help getting a new idev node, see this tutorial. |
Code Block | ||||
---|---|---|---|---|
| ||||
samtools view -b -S -o SRR030257.bam SRR030257.sam
samtools sort SRR030257.bam -o SRR030257.sorted.bam
samtools index SRR030257.sorted.bam |
Tip |
---|
This is a really common sequence of commands, so you might want to add it to your personal cheat sheet. |
Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ls -1 # This is the number 1 not the letter L. Several people seem to get confused about this each year, it is not necessary for any reason other than to give a single column of output regardless of window size.
|
Code Block | ||
---|---|---|
| ||
NC_012967.1.fasta
NC_012967.1.fasta.fai
SRR030257.bam
SRR030257.sam
SRR030257.sorted.bam
SRR030257.sorted.bam.bai |
Expand | ||
---|---|---|
| ||
Sure enough, it's the index file for the BAM file. |
Tip |
---|
You might be tempted to On the other hand, compressing SAM files will save a fair bit of space, but the conversion between bam back to sam is pretty simple. Making storage of bam files likely a better decision. |
Call genome variants
Now we use the mpileup
command from samtools
to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF).
Code Block | ||
---|---|---|
| ||
samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf
|
...
title | What are all the options doing? Try calling samtools mpileup without any options to see if you can figure it out before clicking for the answer |
---|
...
reference sequence file that has a corresponding faidx index .fai file
...
Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using the bcftools command (the which command will tell you where this command is located and examination of that path should tell you how you have access to it).
Code Block | ||
---|---|---|
| ||
which bcftools
bcftools call -v -c SRR030257.bcf > SRR030257.
It suggests something you either did not yet complete the mapping tutorial, or more likely, you stored these files in a different directory. If you think you completed the mapping tutorial, get my attention and be ready to share your screen and I'll try to help you find your missing files. When copy commands execute successfully, the expected output is silent (no output at all) |
Index the FASTA reference file
First, you need to index the reference file. (This isn't the same as indexing it for read mapping. It's indexing it so that SAMtools can quickly jump to a certain base in the reference.)
Code Block | ||||
---|---|---|---|---|
| ||||
samtools faidx NC_012967.1.fasta
|
Take a look at the new *.fai file that was created by this command see if you have any idea what some of the numbers mean.
Code Block | ||||
---|---|---|---|---|
| ||||
less NC_012967.1.fasta.fai # can exit with "q"
|
As you can see, the less command also works perfectly well with files that are not in danger of crashing anything without cluttering your terminal with lines of a file.
Convert mapped reads from SAM to BAM, sort, and index
SAM is a text file, so it is slow to access information about how any given read was mapped. SAMtools and many of the commands that we will run later work on BAM files (essentially GZIP compressed binary forms of the text SAM files). These can be loaded much more quickly. Typically, they also need to be sorted, so that when the program wants to look at all reads overlapping position 4,129,888, it can easily find them all at once without having to search through the entire BAM file.
The following 3 commands are used to:
- convert from SAM to BAM format
- sort the BAM file
- index the sorted BAM file
As you might guess this is computationally intense and as such must be iDEV node or submitted as a job (more on this on Friday). If you want to submit them to the job queue, you will want to separate them with a ";" to ensure that they run sequentially rather than simultaneously as each uses the output of the previous command. Under no circumstances should you run this on the head node.
Warning | ||
---|---|---|
| ||
Use showq -u to verify you are still on the idev node. If not, and you need help getting a new idev node, see this tutorial. |
Code Block | ||||
---|---|---|---|---|
| ||||
samtools view -b -S -o SRR030257.bam SRR030257.sam
samtools sort SRR030257.bam -o SRR030257.sorted.bam
samtools index SRR030257.sorted.bam |
Tip |
---|
This is a really common sequence of commands, so you might want to add it to your personal cheat sheet. |
Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
ls -1 # This is the number 1 not the letter L. Several people seem to get confused about this each year, it is not necessary for any reason other than to give a single column of output regardless of window size.
|
Code Block | ||
---|---|---|
| ||
NC_012967.1.fasta
NC_012967.1.fasta.fai
SRR030257.bam
SRR030257.sam
SRR030257.sorted.bam
SRR030257.sorted.bam.bai |
Expand | ||
---|---|---|
| ||
Sure enough, it's the index file for the BAM file. |
Tip |
---|
You might be tempted to On the other hand, compressing SAM files will save a fair bit of space, but the conversion between bam back to sam is pretty simple. Making storage of bam files likely a better decision. |
Call genome variants
Now we use the mpileup
command from samtools
to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF).
Code Block | ||
---|---|---|
| ||
samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf
|
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
|
The samtools mpileup
command will take a few minutes to run. As practice for a fairly common occurrence when working with the iDEV environment, once the command is running, you should try putting it in the background by pressing control-z
and then typing the command bg
so that you can do some other things in this terminal window at the same time. Remember, there are still many other processors available on this node for you to do other things! Just remember that if you have something running in the background you need to check in with it to see if it is still running with the ps
command or watch the command line every time you execute a new command as you may see information about your background task having finished.
Convert genome variants to human readable format
Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using the bcftools command (the which command will tell you where this command is located and examination of that path should tell you how you have access to it).
Code Block | ||
---|---|---|
| ||
which bcftools
bcftools call -v -c SRR030257.bcf > SRR030257.vcf
|
Expand | ||
---|---|---|
| ||
which bcftools returns the following line: /opt/apps/intel18/samtools/1.6/bin/bcftools You may recognize from above that it is the same directory samtools is found in when you load samtools with the module system. |
...
Take a look at the SRR030257.vcf
file using less
. It has a nice header explaining what the columns mean, including answers to some of your questions from yesterday's presentations. Below this are the rows of data describing potential genetic variants.
Analyzing variants detected
VCF format has alternative Allele Frequency tags denoted by AF= Try the following command to see what frequency our variants exist at.
...
Will preserve all lines that don't have 'AF1=0' value on the line and is one way of doing this. If you look closely at the non-filtered file you will see that the frequencies are given as AF1=0.### so by filtering out lines that have 'AF1=0' in them we get rid of all frequencies that are not 1, including say 'AF1=0.99'.
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Remember from our Raw Sequencing Data tutorial yesterday that we can group certain characters together by placing them between square brackets [].
Here we added a decimal point after the 0, and then allowed for a match to any digit between 0 and 8. |
Optional Exercises at the end of class or for Wednesday/Thursday choose your own tutorial time.
Calling variants in reads mapped by BWA or the improved read quality reads
...
Follow the same directions to call variants in the BWA or Bowtie2 mapped reads with the improved quality. Just be sure you don't write over your old files. Maybe create new directories like GVA_samtools_bwa
and GVA_samtools_bowtie_improved
for the output in each case.
...
If you do not have the output from the Mapping tutorial, run the first 4 commands to copy over the output that would have been produced. Then, you can immediately start this tutorial!
Code Block |
---|
cds
mkdir GVA_samtools_tutorial # if directory already exists, don't worry it won't delete the current contents
cd GVA_samtools_tutorial
cp -r $BI/gva_course/mapping/bowtie2 .
# optional commands for optional and more in-depth tutorials
cp -r $BI/gva_course/mapping/bowtie . #note these files were created with bowtie NOT bowtie2 and is a drastically different read mapper despite the similar name
cp -r $BI/gva_course/mapping/bwa .
|
...
say 'AF1=0.99'.
Expand | |||||||
---|---|---|---|---|---|---|---|
| |||||||
Remember from our Raw Sequencing Data tutorial yesterday that we can group certain characters together by placing them between square brackets [].
Here we added a decimal point after the 0, and then allowed for a match to any digit between 0 and 8. |
Optional Exercises at the end of class or for Wednesday/Thursday choose your own tutorial time.
Calling variants in trimmed reads.
- Trim both Read1 and Read2 using info from read preprocessing tutorial.
- Map reads with bowtie2 using info from read mapping tutorial.
- Call variants using this tutorial.
Remember in the intro tutorial we talked about file/direcotory naming. Be sure you don't write over your old files. Maybe create a new directories like GVA_samtools_bowtie_improved
for the outputs.
Further Optional Exercises
- Which mapper finds more variants?
- Can you figure out how to filter the VCF files on various criteria, like coverage, quality, ... ?
- How many high quality mutations are there in these E. coli samples relative to the reference genome?
- Look at how the reads supporting these variants were aligned to the reference genome in the Integrative Genomics Viewer (IGV). This will be a separate tutorial for tomorrow.
Other version version of samtools
As suggested in the initial introduction, the point of this optional tutorial is to work through getting a different version of samtools to work (the command line expectations, flags, and subcommands (ie bcftools call) were not what they are now in version 0.1.18). To make sure you are starting in the right place:
...
Good luck, and remember if you undertake this and get frustrated with it, it is a great learning experience and is by far probably the most difficult thing you will attemptexercise/tutorial in this class. As part of the learning experience, feel free to contact me with any questions or problems you are specifically having with it, but cookbooked suggestions would defeat the intended purpose of beating your head against the problem to figure it out. You DO have the necessary skills to figure out how to do this now.
Expand | ||
---|---|---|
| ||
The 2016 class did this tutorial in the opposite manner where the 0.1.18 version was given as the walkthrough rather than the 1.6 version. I strongly suggest you use that tutorial only as a way to check your answers, and hence the lack of a link to but that tutorial can be found here. |
Return to GVA2019 GVA2020 course page.