/
Genome Analysis Toolkit (GATK) . -- GVA2020

Genome Analysis Toolkit (GATK) . -- GVA2020


Overview

The Genome Analysis Toolkit (GATK) is a set of programs developed by the broad institute with an extensive website. As mentioned in the final presentation, it has the ability to perform much of the analysis required for calling genomic variants as well as many many other things. Why you may ask yourself did this magical tool only appear on the final day of the class? GATK uses read mappers, read aligners, variant callers, and all the other things (or similar things) that you have been introduced to throughout the course so we have actually been going over what you needed to know in smaller more digestible chunks.

This tutorial is quite small and does not showcase but the smallest drop in a bucket of what GATK is capable of doing. This is because the broad itself has developed many many tutorials for all the different things GATK does and extensive forums are available if the tutorials are not enough to get you through what you are trying to do. Finally, as the makers of the software they have put out and maintain what they regard as the best way to use their product in the form of 'best practices'. If you are going to use GATK, its a real real real good idea to make sure you are following their best practices because that is a situation where people will raise a big eyebrow if you say you are going against the flow.

While GATK is great, one stop shops often are often not the best at everything they do, don't be afraid to use other programs. Particularly following what other researchers are doing in your field

Objectives

  1. Load GATK on lonestar
  2. Use the sample data provided by the broad (through TACC) to verify that TACC is working
  3. Explore a little of what is under the hood.

Tutorial: Loading GATK

While you may think based on the overview that GATK is an obvious choice for a module on TACC, you may be surprised to learn that seemingly every other year TACC removes it as a module, and this is a bad year. On the plus side, it means that once we install it for you locally, the only issue will be if you need to update the version, and recent changes to GATK have made it much easier to work with.

Instaling GATK and verifying it is functioning
# set up directories
mkdir $WORK/src
mkdir -p $HOME/local/bin


# download file and extract it
cd $WORK/src   
wget https://github.com/broadinstitute/gatk/releases/download/4.1.7.0/gatk-4.1.7.0.zip
unzip gatk-4.1.7.0.zip

# copy executables to somewhere already in your $PATH variable (remember we set this up on Monday in your .bashrc file)
cp gatk-4.1.7.0/gatk $HOME/local/bin
cp gatk-4.1.7.0/*.jar $HOME/local/bin

# verify correctly installed
cds
gatk -help # if this does not output a large list of colored text, try the following command and if that does not output colored text get my attention
gatk --list

If you see 316 lines of a long scrolling output detailing some copyright information and a bunch of different commands everything is correctly loaded. While individual tools will require different options and the program itself takes many different options only 3 things are ALWAYS required:

flagDescription

Tool name, what tool are you trying to use
-RReference sequence file
-IInput bam file

Stealing a nice mnemonic devices from a GATK toturial (which is condensed below), these 3 arguments don't have to be in this order, but if you learn them in this order, you will be able to remember them if you TRI. Remember, specific tools will require additional arguments.

Getting sample data

Rather than using sample data specifically for this tutorial, we will instead do a small tutorial based on our read mapping tutorial from day 2 of the course. Assuming you completed that tutorial you the following tutorial should work.

You are trying to copy the SRR030257.sam file from the $SCRATCH/GVA_bowtie2_mapping/bowtie2/SRR030257.sam and the NC_012967.1.fasta file from the $SCRATCH directory
mkdir $SCRATCH/GVA_GATK
cd $SCRATCH/GVA_GATK
cp /scratch/01821/ded/GVA_bowtie2_mapping/bowtie2/SRR030257.sam .
cp /scratch/01821/ded/GVA_bowtie2_mapping/NC_012967.1.fasta .

Next you need to convert the .sam file to a .bam file.

Refresher on how to convert .sam files into .bam files
samtools view -S -b SRR030257.sam > SRR030257.bam 


# Do you remember what tutorial we used this command in before?


Tutorial: Use GATK to count the number of reads in a bam file

Using the following information we will use gatk the CountReads tool to count the number of reads in the SRR030257.bam file which was from the NC_012967.fasta reference file. Pay attention to the the words in bold and the table/discussion in the previous tutorial section and see if you can figure out how to do this on your own.

 Check your answer

gatk  CountReads -R NC_012967.1.fasta -I SRR030257.bam