An introduction to GATK (GVA14)

GATK: A much more sophisticated option for variant calling

The Genome Analysis Tookit (GATK) has been developed and continually refined at the MIT Broad Institute.  It is the "gold standard" for human genotyping in most instances (tumor biology being a possible exception).  Here is the original publication on the toolset.

Always read the "best practices" documents for new releases - things do change!  The best (most stable) place to find this documentation is here on their Version History page UNDER THE "Guide Book (PDF)" tab.

Here is their general workflow:

(from: http://www.broadinstitute.org/gatk/guide/best-practices) 

Pipeline explanation:

  1. The first few steps are "Non-GATK" - meaning they are just what we have already been doing in the course.  You may "map to reference" using either BWA or bowtie2, but you will find BWA used more often with GATK.  BE SURE TO MAP WITH READ GROUPS.
  2. The Mark Duplicates step serves the purpose of pre-identifying PCR duplicates in your data so that the genotyper doesn't get skewed by potential amplification artifacts when weighing support for a variant.  If your library is reasonably complex and not vastly over-sampled, this should only knock out about 10-20% of your raw data.  If it takes out a higher percentage of your data but you still have significant coverage (e.g. >30x at minimum, or >50x on average) then you should still be fine.  If you have many samples, you can get away with lower per-sample coverage since "Variant Discovery" is done jointly across the entire sample set.
  3. Indel realignment - turns out that insertions & deletions can lead to artifacts like SNPs being called at their edges just because the mapper can't quite figure out where the in/del should be relative to a SNP on a per-read basis.  There are two steps to this - "RealignerTargetCreator" that figures out where realignment should be done and then "IndelRealigner" that does the re-alignment.  It's a two-step process because you can use external information (e.g. the indel catalog provided by the 1000 genomes project) to more sensitively detect possible indel sites.
  4. BaseRecalibrator - re-scores the per-base quality scores to be more accurate.  Raw quality scores in your fastq files come from aggregate measures of images, sequencing-by-synthesis intensities, spatial variation on the flow cell, etc.  Re-calibration takes the mapped raw data, checks how good it is against the given reference AND a list of common SNPs (like dbSNP for human), builds a new error model with this "training set" and provides a "recalibration table" that subsequent GATK tools can use to tweak base quality values.
  5. RR compression is optional - just saves disk space by intelligently throwing away data that is not required downstream.  We have plenty of space at TACC, so I recommend not using it.
  6. Joint Variant Calling can be done by one of two tools: HaplotypeCaller or UnifiedGenotyper.  HaplotyperCaller is the go-to tool for human genomes & exomes.  UnifiedGenotyper is more tweak-able - better for tumor studies (e.g. mixed populations) and if you may have ploidy > 2.   Here is a great blog entry by Craig DeLoughery comparing the two.
  7. Variant Recalibration is the same song, different verse, to Base Recalibration.
  8. GATK has an extensive set of tools for filtering and functional annotation.  SPHS likes many other ways of doing this better (like grep/awk/sed/sort/join for filtering, and annovar for annotation), but if you're sticking to human only analysis then the GATK tools and suggested workflows are probably a good idea.

Here is a lovely extensive write-up by Mel Melendrez explaining the GATK pipeline steps with nice examples showing the utility of these many steps. 

Notes about running GATK:

  1. The documentation appears extensive, but it's written for experts - you have to know a lot to make use of their documentation.
  2. There is only one command, "GenomeAnalysisTK.jar" which as you can see is a Java (64-bit) Jar file, so to run anything the syntax is: 

    java -Xmx4g -jar $TACC_GATK_DIR/GenomeAnalysisTK.jar -T <GATK module you want to run> <other options>

    Implications of this, and other notes:

    1. Watch out for Java memory requirements: -Xmx4g has always worked for me.  Someday it might not.
    2. Tools usually have "inherited options".  These are options that are used across many tools.  They are not documented on the tools page but on the GATK command line page.  This is a neat idea, but makes it hard to track down details for a specific command.
    3. Worse than having split documentation, options aren't always consistent.  For instance, for the IndelRealigner tool uses "-known" while the RealignerTargetCreator uses "--known" (two dashes) instead even though you run them back-to-back.
    4. Multithreading options are really tricky - they may or may not be supported on any given tool.
    5. Broad is constantly working on the software.  Things like the names of options, parameter values, etc. may change over time (though it is increasingly stable) so watch out if you upgrade.
    6. Unlike most programs which fail very quickly (i.e. seconds) if you give them bad input, GATK takes a while to fail - it can take a minute or two just to get a tool going to the point that it fails, and the error will be buried in a lot of other verbose output.
    7. Pay attention to parameters and be willing to invest significant time tuning these to your specific application.  The GATK documentation specifically says to do this - don't use it blindly (like any complex tool).
  3. Almost all the tools are deathly slow - pretty much any step done in GATK will take a long time.  For instance, when I run human exomes at 40 million 2x100 read-pairs per sample, mapping takes only a few hours but the rest of the GATK pipeline can take another 24 hours on a single sample.
  4. There are two ways to "run" GATK tools: a) command line mode, one tool at a time, and b) pipeline mode. - which assumes you are running the LSF job scheduling system, consequently it's not used at TACC (Lonestar uses SGE and Stampede uses Slurm).

Prerequisites for running GATK

Resource bundles: reference genomes, SNP databases, etc. are required for GATK to work.  Here are some options & ideas:

  1. The best option is to use the ones in $BI if they are there (in $BI/ref_database/GATK) (NOTE: You can join the BioITeam and put them there!!  Free storage this way!).  Otherwise make your own local copy from those provided by GATK.   TACC does not maintain these for the community (yet). 

     Tips on downloading, setting up GATK bundles...

    To pull GATK resource bundles and check them:

    ftp ftp.broadinstitute.org

    username: gsapubftp-anonymous
    <find bundle you want>
    prompt
    bin
    mget *
     

    Then: cd to dir, fix md5 files:

    for file in `ls *.md5`; do sed -i 's/\(\/\).*\(\/\)//' $file; done

    do md5 check:

    for file in `ls *.md5`; do md5sum -c $file; done
  2.  If you are in a non-model system, you are on your own to make the resource bundle.  For instance, if you have a fasta file of your reference then you'll need to use the picard tool CreateSequenceDictionary.jar to make a dictionary of it for GATK.  Use "module load picard" and then "ls $TACC_PICARD_DIR" to see all the picard tools at TACC.

Read groups: All your .bam files MUST have read groups defined in their headers AND each sequence must be tagged to belong to a read group.

  1. This is usually easily done during mapping:
    1. the "-r" option with bwa sampe
    2. the -R option with bwa mem
    3. the "- rg-id" AND "-rg" options with bowtie2
      but you MUST tell the mapper to do it in all cases!
  2. If you didn't do it at mapping you can use samtools mergepicard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands.
  3. Picard's MarkDuplicates requires the "LB" field to specify the sample's library (usually sample = library)
  4. GATK requires the ID, PL, and SM fields.
 Here's an abbreviated SAM header with properly defined read groups
SAM header for ONLY chr21, but showing valid RG tags for GATK/Picard
@HD	VN:1.0	SO:coordinate
@SQ	SN:chr20	LN:63025520
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0
@RG	ID:NA12878_R1.fq.bowtie.sorted	PL:ILLUMINA	SM:NA12878	LB:L78
@RG	ID:NA12891_R1.fq.bowtie.sorted	PL:ILLUMINA	SM:NA12891	LB:L91
@RG	ID:NA12892_R1.fq.bowtie.sorted	PL:ILLUMINA	SM:NA12892	LB:L92


A working (on Lonestar) GATK pipeline script

 Click here to see the pipeline. Remember that you'll need to copy resources to $SCRATCH and update file paths on Stampede.

This short script will run all the many GATK commands on a SINGLE bam file - that's not best practices if you have more than one sample, but it's easy to modify this to do the first few steps separately, then join your BAMs (samtools merge) and run the last few steps.

WARNING: GATK is slow - running this script on only 12 million reads (6 million read-pairs) takes about 2 hours on Lonestar!

GATK pipeline (also at $BI/scripts)
#!/bin/bash
bamfile=$1


# Start base recalibrator
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar  -T BaseRecalibrator -I $bamfile -R /scratch/01057/sphsmith/hg19/chr20.fasta -knownSites /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/dbsnp_137.hg19.vcf -o $bamfile.recal_data.table &> $bamfile.recal_data.table.log &


# Start picard's mark duplicates
java -Xmx2g -jar /opt/apps/picard/1.107/MarkDuplicates.jar INPUT=$bamfile OUTPUT=$bamfile.dedup.bam METRICS_FILE=$bamfile.dedup.bam.metrics ASSUME_SORTED=TRUE &


# Wait for mark duplicates and base recalibrator to finish - they are independent so can run in parallel
wait


# Index the new de-duplicated bam file for GATK!
samtools index $bamfile.dedup.bam


# Now create the targets for realignment
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar  -T RealignerTargetCreator -BQSR $bamfile.recal_data.table -I $bamfile.dedup.bam -R /scratch/01057/sphsmith/hg19/chr20.fasta --known /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/1000G_phase1.indels.hg19.vcf -o $bamfile.dedup.bam.intervals &> $bamfile.dedup.realigner.log 


# Now do the realignments
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar  -T IndelRealigner -BQSR $bamfile.recal_data.table -I $bamfile.dedup.bam -R /scratch/01057/sphsmith/hg19/chr20.fasta -known /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/1000G_phase1.indels.hg19.vcf -targetIntervals $bamfile.dedup.bam.intervals -o $bamfile.dedup.realigned.bam &> $bamfile.dedup.indelrealigner.log 


# And finally call the genotypes
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar -T HaplotypeCaller -R /scratch/01057/sphsmith/hg19/chr20.fasta -I $bamfile.dedup.realigned.bam --dbsnp /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/dbsnp_137.hg19.vcf -BQSR $bamfile.recal_data.table -o $bamfile.dedup.realigned.bam.vcf &> $bamfile.sorted.dedup.realigned.bam.vcf.log 
 


 

Compare samtools to GATK on exome 20.

 

Diversions

 

These are oddities:

 

join NA12892.raw.vcf.simple NA12891.raw.vcf.simple | awk '{if ($3!=$6 || $4!=$7) {print}}'