Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Expand
titleExpand here for detailed descriptions of the troubleshooting that took place for this last year.


Info

The assumption last year was that the correct command would be: conda install -c bioconda samtools as it is what was listed at https://anaconda.org/bioconda/samtools. Instead the correct command ended up being: conda install -c bioconda samtools bcftools openssl=1.0 

There are 2 different things going on in this command.

  1. Forcing the installation of a specific version of openssl. In this case, a lower version than would normally be installed if samtools were installed by itself. According to https://github.com/bioconda/bioconda-recipes/issues/12100 my understanding is that when the conda package was put together there is an error wherein samtools specifically says to get version 1.1 of openssl, but the samtools program specifically requires version 1.0 to be present.
  2. We are installing both samtools and bcftools at the same time. This can clean up some installation problems when there are conflicts between individual packages and you want to use them in a single environment. An alternative would be to have a samtools environment and a bcftools environment, but that creates unnecessary steps of having to change environments in the middle of your analysis.
Expand
titleClick here to expand and see what the outcome of the assumed installation command is, what the problem is, and steps you could take to fix it.


Warning
titleThis box contains example commands and outputs showing you something that does NOT work for educational and diagnostic purposes. If you use the code listed in this box, be sure you use ALL the code or you may run into downstream problems with this tutorial.


Code Block
languagebash
conda install -c bioconda samtools
samtools --version

The above command appeared to install correctly as other conda installations did, but the second command which you would expect to show the version of samtools instead returns the following error:

No Format
samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory

Googling the entire error the top results clearly mentioned conda and several pages listed problems associated "fixes" with different conda installation commands:

  1. One of the suggested fixes was to add access to the conda-forge channel (as we have done this year).
  2. Additionally, last year the entire course was taught with the hope of sequentially adding new programs to a single growing environment. As mentioned yesterday, such an approach is not always optimal/easy, and hence why this year we are creating a number of additional environments. Working from the assumption that you wanted to keep a single environment, one fix that appeared to be working well based on community feedback was conda install -c bioconda samtools=1.9 --force-reinstall, but at the expense of altering existing packages in the environment. When running the command, the number of programs that would be downgraded was nearly 2 full screens long.
  3. Rather than jumping to the "force-reinstall" solution it was suggested to copy the existing conda environment to a new "test" environment (conda create --name GVA2021-samtools-test --clone GVA2021). Once in the new environment, using the "force-reinstall" command above would have given access to samtools, but would then also require testing other programs in the environment (such as bowtie2, cutadapt, fastqc). Assuming all programs still (seemed) to work you could then rename the environment (conda create --name GVA2021-V2 --clone GVA2021-samtools-test; conda env remove --name GVA2021-samtools-test). Obviously the more programs that are added, the more likely running into these kinds of conflicts (where different versions of the same dependency are required), and the more programs you would have to check to see if the "test" envvironment broke any previously installed programs. 
  4. As there was information that suggested the issues were specific to version 1.12 (including: https://github.com/bioconda/bioconda-recipes/issues/13958), another solution was simply to install an older version of samtools deliberately from the start. In order to do this, I first had to remove the existing (incorrect) samtools version (conda remove samtools) and then specify the older version we wanted to use (conda install -c bioconda samtools==1.11). This then allowed the samtools --version command to give expected output of version 1.11 Unfortunately, during testing it quickly became obvious that the next program to install (bcftools) was going to create an entirely new set of installation problems meaning that samtools again hat to be uninstalled and samtools bcftools and the downgraded openssl version installed together. conda remove samtools ; conda install -c bioconda samtools bcftools openssl=1.0. 





...

No Format
samtools 1.15.1
Using htslib 1.15.1
Copyright (C) 2022 Genome Research Ltd.
# Followed by a bunch of compilation details.


bcftools --version output:

...

Warning
titleUnexpected output when you try to copy final files from mapping tutorial

If you see messages saying something similar to the following:

No Format
cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/bowtie2/SRR030257.sam': No such file or directory
cp: cannot stat '/scratch/01821/ded/GVA_bowtie2_mapping/NC_012967.1.fasta': No such file or directory

It suggests something you either did not yet complete the mapping tutorial, or more likely, you stored these files in a different directory. If you think you completed the mapping tutorial, get my attention and be ready to share your screen and I'll try to help you find your missing files.

When copy commands execute successfully, the expected output is silent (no output at all)

...

Assuming you have the above output for samtools --version and bcftools --version (both 1.915.1), first, you need to index the reference file. (This isn't the same as indexing it for read mapping. It's indexing it so that SAMtools can quickly jump to a certain base in the reference.)

...

Warning
titleDo not run on head node

Use hostname to verify you are still on the idev node. ( expect to see a computer number (NOT a login number) in front of 

stampede2.tacc.utexas.edu

If not, and you need help getting a new idev node, see this tutorial.


Code Block
languagebash
titleCommands to be executed in order...
samtools view --threads 64 -b -S -o SRR030257.bam SRR030257.sam
samtools sort --threads 64 SRR030257.bam -o SRR030257.sorted.bam
samtools index -@ 64 SRR030257.sorted.bam


Expand
Tip
This is a really common sequence of commands, so you might want to add it to your personal cheat sheet if you are keeping one.

It is expected that the first command generate no output, the second command to generate a single line from the bam_sort_core regarding files and memory blocks, and the third line to again generate no output. While the first 2 commands take a minute or two, the third command is very quick.

Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:

Code Block
languagebash
titleList the contents of the output directory to see what new files have been created
collapsetrue
ls -1  # This is the number 1 not the letter L. Several people seem to get confused about this each year, it is not necessary for any reason other than to give a single column of output regardless of window size.
Code Block
titleExpected output
NC_012967.1.fasta
NC_012967.1.fasta.fai
SRR030257.bam
SRR030257.sam
SRR030257.sorted.bam
SRR030257.sorted.bam.bai
Expand
titleGiven what we saw above of fasta index file gaining a .fai ending, can you guess what a *.bai file is?

Sure enough, it's the index file for the BAM file.

Tip

You might be tempted to gzip BAM files when copying them from one computer to another. Don't bother! They are already internally compressed, so you won't be able to shrink the file. Further, to the best of my knowledge, no programs accept a gzipped bam file as a format to use.

On the other hand, compressing SAM files will save a fair bit of space, but the conversion between bam back to sam is pretty simple. Making storage of bam files likely a better decision.

Call genome variants

Now we use the mpileup command from samtools to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF). For more information about VCF files: https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/

Code Block
languagebash
titleYou should only execute *one* of these commands
samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf
bcftools mpileup -O u -f NC_012967.1.fasta SRR030257.sorted.bam -o bcftools.SRR030257.bcf

...

titleview sort and index are all subcommands to the samtools program, and --help can be invoked to get more information about each of the subcommands. Can you figure out what the purpose of each of the above options is?


subcommandflagvaluepurpose
view/sort--threads64use 64 additional threads
view-bhas no valueis a toggle to output in bam format
view-Shas no valuewas a toggle to declare the input format as sam. help now tells you that this is depreciated as the program auto detects input format
view-oSRR030257.bamwrite the output of the command to the SRR030257.bam file
view
SRR030257.samunflagged option/keyword. in this case, the top line of the help output lists:
Usage: samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]

the <>|<>|<> section states that you can give a bam, sam, or cram file as input

sort
SRR030257.bamunflagged option/keyword. in this case, the top line of the help output lists:

Usage: samtools sort [options...] [in.bam]

the in.bam section states that the input must be in bam format.
Note that the input to the sort subcommand is the output of the view command.

sort-oSRR030257.sorted.bamwrite output to file SRR030257.sorted.bam
index-@64use 64 additional threads. note that in both the previous help outputs, they listed -@, --threads because both are recognized as the same flag. in the first 2 cases, we used the --threads but could have used -@ but for the index subcommand, only -@ was allowed.
index
SRR030257.sorted.bam

unflagged option/keyword. in this case, the top line of the help output lists:
Usage: samtools index [-bc] [-m INT] <in.bam> [out.index]

in this case we are again using the previous commands output file as an input file. 
Note that we have not listed an output file here. Keep that in mind as you read further on.


Info
titleWhy was they -S option included in the view command above if its use is depreciated?

This is included to highlight 2 things:

  1. sometimes you find or are given commands that work, but you dont necessarily understand why they work.
  2. sometimes program version updates change options in ways that dont require you to update your commands



Tip
titleCheat Sheat
This "view" "sort" "index" series is a really common sequence of commands, so you might want to add it to your personal cheat sheet if you are keeping one.

It is expected that the first command generate no output, the second command to generate a single line from the bam_sort_core regarding files and memory blocks, and the third line to again generate no output. While the first 2 commands take a few minutes each and the third command is very quick on a single thread, with 64 additional threads all complete very quickly.

Examine the output of the previous commands to get an idea of whats going on. Here are some prompts of how to do that:

Code Block
languagebash
titleList the contents of the output directory to see what new files have been created
collapsetrue
ls -1  # This is the number 1 not the letter L. Several people seem to get confused about this each year, it is not necessary for any reason other than to give a single column of output regardless of window size.


Code Block
titleExpected output
NC_012967.1.fasta
NC_012967.1.fasta.fai
SRR030257.bam
SRR030257.sam
SRR030257.sorted.bam
SRR030257.sorted.bam.bai


Expand
titleGiven what we saw above of fasta index file gaining a .fai ending, can you guess what a *.bai file is?

Sure enough, it's the index file for the BAM file. Since we did not specify and output file in our index command, it was assumed we wanted to simply append ".bai" corresponding to its new type of a BAI-format index of our input bam file.


Tip

You might be tempted to gzip BAM files when copying them from one computer to another. Don't bother! They are already internally compressed, so you won't be able to shrink the file. Further, to the best of my knowledge, no programs accept a gzipped bam file as a format to use.

On the other hand, compressing SAM files will save some space, but the conversion between bam back to sam is pretty simple/quick. Making storage of bam files likely a better decision.

Call genome variants

Now we use the mpileup command from samtools to compile information about the bases mapped to each reference position. The output is a BCF file. This is a binary form of the text Variant Call Format (VCF). For more information about VCF files: https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/

Code Block
languagebash
titleYou should execute this command as is without trying to determine the options for yourself as even with 64 threads it will take several minutes to complete. During the completion time you can review the different options used.
bcftools mpileup
bcftools mpileup -O u -f NC_012967.1.fasta SRR030257.sorted.bam -o SRR030257.bcf


Expand
titleUsing the information from the bcftools mpileup command without any options, can you figure out what are all the options doing?


Optionpurpose

--threads 64

use 64 additional threads
-f NC_012967.1.fasta

reference sequence file that has a corresponding faidx index .fai file

SRR030257.sorted.bamBAM input file to calculate pileups from
-O ugenerates uncompressed BCF output
-o SRR030257.bcfOutput file SRR030257.bcf



Info
titleHistorical command (now depreciated) which used to give nearly the same output as the bcftools mpileup command

Last year, in addition to the bcftools mpileup command listed above, a second command a second command using samtools mpileup was listed as an option. This was a very common command structure that you may come across elsewhere  (samtools mpileup -u -f NC_012967.1.fasta SRR030257.sorted.bam > SRR030257.bcf). note that besides using the program samtools instead of bcftools, the only differences are the use of -u instead of -O u, and piping the output (>) to the SRR30257.bcf file instead of naming the file with a -o flag.

Last year it was noted that if you tried the samtools command, there was a  warning stating that:

"samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future."

Further, I warned that adjusting to the new command would have value as typically once programers start warning that functionality is "depreciated" it is only a matter of time before it is "no longer supported" and then just flat out "

...

broken". Sure enough, in less than a year's time, and in less than 4 version updates, it is no longer working. This is one of the reasons why updating to the newest version of a program is not always recommended if the version you are using is working for you (more on Friday

...

titleWhat are all the options doing? Try calling samtools mpileup without any options to see if you can figure it out before clicking for the answer

...

reference sequence file that has a corresponding faidx index .fai file

...

Direct output to SRR030257.bcf file, rather than printing to the screen

...

).


Tip
titleSending programs to the background.

The samtools mpileup command will take a few minutes to run even with 64 threads. If you have read through the information about the different options, as practice for a fairly common occurrence when working with the iDEV environment,

...

you could try putting it in the background by

...

pressing control-z and then typing the command bg so that you can do some other things in this terminal window at the same time. Remember, there are still many other processors available on this node for you to do other things! Just remember that if you have something running in the background you need to check in with it to see if it is still running with the ps command or watch the command line every time you execute a new command as you may see information about your background task having finished.

...

Info

The fg command (foreground) is the opposite of the bg (background) command. If you want to return your command to your active prompt so you are notified directly when the command finishes (or errors) simply type 'fg' assuming you only have 1 job running in the background.

Convert genome variants to human readable format

Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using a bcftools command.

Code Block
languagebash
bcftools call -v -c SRR030257.bcf > SRR030257.vcf

What are these options doing?

the background.

Expand
titleSee if you can start with the base command "bcftools" and figure out what each option above is doing on the bcftools command we used above. Click here when you think you know.
Optionpurpose
callspecific subcommand to be executed by bcftools for calling SNP and indels
-v

output potential variant sites only

-cconsensus calling
SRR030257.bcf
input bcf file
> SRR030257.vcf
output as a vcf file


Convert genome variants to human readable format

Once the mpileup command is complete, convert the BCF file to a "human-readable" VCF file using a bcftools command.

Code Block
languagebash
bcftools call -v -c SRR030257.bcf > SRR030257.vcf

What are these options doing?

Expand
titleSee if you can start with the base command "bcftools" and figure out what each option above is doing on the bcftools command we used above. Click here when you think you know.


Optionpurpose
callspecific subcommand to be executed by bcftools for calling SNP and indels
-v

output potential variant sites only

-cconsensus calling
SRR030257.bcf
input bcf file
> SRR030257.vcf
output as a vcf file

If you are especially observant, you might notice that in the bcftools mpileup options, the output type option (-O) had an option v which lists that the output file would have been generated as a uncompressed vcf file. While this may seem like you could have used -O v instead of -O u and skipped a step, note that the mpileup subcommand lacked options for only outputting variant sties or using consensus calling.

Take a look at the SRR030257.vcf file using less. It has a nice header explaining what the columns mean, including answers to some of your questions from yesterday's presentations. https://docs.gdc.cancer.gov/Data/File_Formats/VCF_Format/ can be used to figure out the columns are and what types of information they provide. Below this are the rows of data describing potential genetic variants.

...

If you look at the AF1= values you will see all the lines are either ~ 0.5, or 1.

...

^splits each line into columns based on where the ";"s, then searches through each column, if the "AF1" is found in the column, that column is printed. From the output it is even clearer that frequencies are coming up.

Tip
titleCheat Sheat
This is a pretty useful awk 1 liner to keep track of for pulling a single column out when you know it will have an identifier but not what column it will occur in; just substitute what you expect to find for "AF1" and substitiute whatever character you want to use to break the line up for -F";" (with -F"\t" for tab and or -F"," being the 2 most common options for tsv and csv files respectively).

...

Expand
titleHow you would change this to variants that have a frequency of at least 90%?

Remember from our Raw Sequencing Data tutorial yesterday that we can group certain characters together by placing them between square brackets [].

Code Block
languagebash
title2 equivalent answers again
cat SRR030257.vcf | grep -v AF1=0.[0-8] > SRR030257.filtered.vcf

grep -v AF1=0.[0-8] SRR030257.vcf > SRR030257.filtered.vcf

Here we added a decimal point after the 0, and then allowed for a match to any digit between 0 and 8. Thus lines that have AF1=1 would not match, nor would a line with AF1=0.9 . You might make a note to think back on this after tomorrow's presentation covering when we should believe variants are real.


Return to GVA2021 GVA2022 course page.


Optional Exercises at the end of class or for Wednesday/Thursday choose your own tutorial time.

...

  1. Trim both Read1 and Read2 using info from read preprocessing tutorial.
  2. Map reads with bowtie2 using info from read mapping tutorial.
  3. Call variants using this tutorial.

...

Further Optional Exercises

  • Which If you use additional mapping programs, which mapper finds more variants?
  • Can you figure out how to filter the VCF files on various criteria, like coverage, quality, ... ?
  • How many high quality mutations are there in these E. coli samples relative to the reference genome?
  • Look at how the reads supporting these variants were aligned to the reference genome in the Integrative Genomics Viewer (IGV). This will be a separate tutorial for tomorrow.


Return to GVA2021 GVA2022 course page.