Genome Analysis Toolkit (GATK) . -- GVA2023


Overview

The Genome Analysis Toolkit (GATK) is a set of programs developed by the broad institute with an extensive website. As mentioned in the final presentation on Friday, it has the ability to perform much of the analysis required for calling genomic variants as well as many many other things. Why you may ask yourself did this magical tool only appear towards the end of class? GATK uses read mappers, read aligners, variant callers, and all the other things (or similar things) that you have been introduced to throughout the course so we have actually been going over what you needed to know in smaller more digestible chunks.

This tutorial is quite small and does not showcase but the smallest drop in a bucket of what GATK is capable of doing. This is because the broad itself has developed many many tutorials for all the different things GATK does and extensive forums are available if the tutorials are not enough to get you through what you are trying to do. Finally, as the makers of the software they have put out and maintain what they regard as the best way to use their product in the form of 'best practices'. If you are going to use GATK, it's a real real real good idea to make sure you are following their best practices because that is a situation where people will raise a big eyebrow if you say you are going against the flow.

While GATK is great, one stop shops often come with default settings/values that may or may not be the best solution for your work. Be sure to understand the options you are using, the options you are not using but could be, and what others doing similar analysis have done in their work. Particularly if your data seems odd or noisy.

Objectives

  1. Install GATK
  2. Explore a little of how GATK works
  3. Provide resources for working with GATK tutorials

idev nodes

Note that the entire tutorial can take place inside an idev session as we will be copying data from earlier tutorials rather than the BioITeam. Not only can it take place on an idev node, it MUST take place on an idev node. While preparing this tutorial, accidentally running the gatk commands on the head node resulted in memory errors.


Tutorial: Installing GATK

Not surprisingly, GATK is available through conda. After some trial and error it seems there are 2 extras needed to get it up and running. 1. conda-forge, and 2. needing to specify that we want the most recent version (this is similar to what you see with spades). I would suggest giving it its own environment perhaps "GVA-gatk"

conda create --name GVA-gatk -c conda-forge -c bioconda gatk4=4.4.0
conda activate GVA-gatk
gatk --version

The first thing you will see is a comment about javvva and the "jar being used. If you are on the headnode, i expect you will get an error message about memory related to java, otherwise your output should be:

The Genome Analysis Toolkit (GATK) v4.4.0.0
HTSJDK Version: 3.0.5
Picard Version: 3.0.0
Verifying gatk is functioning
gatk --list

If you see ~300 lines of a long scrolling output detailing some copyright information and a bunch of different commands everything is correctly loaded. While individual tools will require different options and the program itself takes many different options only 3 things are ALWAYS required:

flagDescription

Tool name, what tool are you trying to use
-RReference sequence file
-IInput bam file

Stealing a nice mnemonic devices from a GATK legacy tutorial (which is condensed below), these 3 arguments don't have to be in this order, but if you learn them in this order, you will be able to remember them if you TRI. Remember, specific tools will require additional arguments.

Getting sample data

Rather than using sample data specifically for this tutorial, we will instead do a small tutorial based on our read mapping tutorial  and samtools SNV tutorial from day 2 of the course. Assuming you completed that tutorial you the following tutorial should work.

You are trying to copy the SRR030257.bam file and NC_012967.1.fasta and its index file from the $SCRATCH/GVA_samtools_tutorial directory
mkdir $SCRATCH/GVA_GATK
cd $SCRATCH/GVA_GATK
cp ../GVA_samtools_tutorial/SRR030257.bam .
cp ../GVA_samtools_tutorial/NC_012967.1.fasta* .


Tutorial: Use GATK to count the number of reads in a bam file


Using the following information we will use gatk the CountReads tool to count the number of reads in the SRR030257.bam file which was mapped against the NC_012967.fasta reference file. Pay attention to the the words in bold and the table/discussion in the previous tutorial section and see if you can figure out how to do this on your own.

check your answer here
gatk CountReads -R NC_012967.1.fasta -I SRR030257.bam

The following lines tell us this is not working as intended

***********************************************************************


A USER ERROR has occurred: Fasta dict file file:///scratch/01821/ded/GVA_GATK/NC_012967.1.dict for reference file:///scratch/01821/ded/GVA_GATK/NC_012967.1.fasta does not exist. Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.


***********************************************************************

While the provided link surprisingly doesn't work, it is clear that it is failing because it is not finding a .dict file for our reference sequence. Quickly googling "gatk create dict file" takes us to https://gatk.broadinstitute.org/hc/en-us/articles/360036729911-CreateSequenceDictionary-Picard-

And the command you can build is
gatk CreateSequenceDictionary -R NC_012967.1.fasta -O NC_012967.1.dict

Using the ls command you can see that the dictionary has been created, and the CountReads command will now run correctly. If we retry our CountReads command we get very different output (note that yours will have subtle differences in things like the names of directories):

22:39:44.804 INFO  CountReads - ------------------------------------------------------------
22:39:44.806 INFO  CountReads - The Genome Analysis Toolkit (GATK) v4.2.6.1
22:39:44.807 INFO  CountReads - For support and documentation go to https://software.broadinstitute.org/gatk/
22:39:44.818 INFO  CountReads - Executing as ded@c455-093.stampede2.tacc.utexas.edu on Linux v3.10.0-1160.53.1.el7.x86_64 amd64
22:39:44.819 INFO  CountReads - Java runtime: OpenJDK 64-Bit Server VM v11.0.13+7-b1751.21
22:39:44.820 INFO  CountReads - Start Date/Time: June 22, 2022 at 10:39:43 PM CDT
22:39:44.821 INFO  CountReads - ------------------------------------------------------------
22:39:44.821 INFO  CountReads - ------------------------------------------------------------
22:39:44.825 INFO  CountReads - HTSJDK Version: 2.24.1
22:39:44.826 INFO  CountReads - Picard Version: 2.27.1
22:39:44.827 INFO  CountReads - Built for Spark Version: 2.4.5
22:39:44.828 INFO  CountReads - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:39:44.828 INFO  CountReads - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:39:44.829 INFO  CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:39:44.830 INFO  CountReads - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:39:44.831 INFO  CountReads - Deflater: IntelDeflater
22:39:44.831 INFO  CountReads - Inflater: IntelInflater
22:39:44.832 INFO  CountReads - GCS max retries/reopens: 20
22:39:44.833 INFO  CountReads - Requester pays: disabled
22:39:44.834 INFO  CountReads - Initializing engine
22:39:45.536 INFO  CountReads - Done initializing engine
22:39:45.537 INFO  ProgressMeter - Starting traversal
22:39:45.539 INFO  ProgressMeter -        Current Locus  Elapsed Minutes       Reads Processed     Reads/Minute
22:40:32.457 WARN  IntelInflater - Zero Bytes Written : 0
22:40:32.466 INFO  CountReads - 7600360 read(s) filtered by: WellformedReadFilter 

22:40:32.471 INFO  ProgressMeter -             unmapped              0.8                     0              0.0
22:40:32.473 INFO  ProgressMeter - Traversal complete. Processed 0 total reads in 0.8 minutes.
22:40:32.473 INFO  CountReads - CountReads counted 0 total reads
22:40:32.474 INFO  CountReads - Shutting down engine
[June 22, 2022 at 10:40:32 PM CDT] org.broadinstitute.hellbender.tools.CountReads done. Elapsed time: 0.81 minutes.
Runtime.totalMemory()=1581252608
Tool returned:
0


What in all that are we actually looking for you might ask? 

22:40:32.473 INFO  CountReads - CountReads counted 0 total reads

But we know this isn't true.

You can check this against the original fastq files SRR030257_1.fastq and SRR030257_1.fastq (check them in your GVA_bowtie2_mapping directory)
grep -c "^+$" $SCRATCH/GVA_bowtie2_mapping/SRR030257_1.fastq
grep -c "^+$" $SCRATCH/GVA_bowtie2_mapping/SRR030257_2.fastq

This tells us that there 7600360 total reads which is certainly greater than 0. While we haven't discussed sam format in much detail, each read gets its own line, and if you compare the .sam file and the original fastq file listed above, you see that each line on the sam file seems to start with 'SRR030257' followed by a number related to the read. This gives us a base handle to check with grep

help checking sam file
grep -c "^SRR030257" SRR030257.sam

In the case of the sam file, you can see that the 7600360 exactly matches what we saw from our fastq files, while it is possible there is an error with the sam to bam conversion the file does have a very large 352 M if there are no reads in that file. 


So where did our reads actually go in our GATK analysis? A few lines above we see this line:

 22:40:32.466 INFO  CountReads - 7600360 read(s) filtered by: WellformedReadFilter 

This tells us that all the reads were removed by a specific filter set. Accessing the CountReads help information, we see we can actually disable the WellformedReadFilter with the -DF or --disable-read-filter. giving us an updated CountReads command of: 

gatk CountReads -R NC_012967.1.fasta -I SRR030257.bam  -DF WellformedReadFilter

Now we see a final line of output being returned by the tool of: 7,600,360 which is what we expected!

GATK has a similar tool 'CountBases' which as the name implies, counts the bases in a sam/bam file.

gatk CountBases -R NC_012967.1.fasta -I SRR030257.bam  -DF WellformedReadFilter

This tells us that there are 273,612,960 total bases in the file, something we have not previously calculated before.

As mentioned this is a very small introduction to GATK adapted from one of the broad's tutorials which can be found here. There are numerous other tutorial links for on the GATK website you can begin to work with now that you have verified your installation and begun troubleshooting some of the quirks of the program.

Return to GVA2023 home page.