Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Here is their general workflow:

(from: http://www.broadinstitute.org/gatk/guide/best-practices) 

Objectives:

  1. Provide a fairly detailed understanding of a common human genome variant analysis program.
  2. Provide a working analysis pipeline and sample data to evaluate said pipeline.

...

  1. The first few steps are "Non-GATK" - meaning they are just what we have already been doing in the course.  You may "map to reference" using either BWA or bowtie2, but you will find BWA used more often with GATK.  BE SURE TO MAP WITH READ GROUPS.

  2. The Mark Duplicates step serves the purpose of pre-identifying PCR duplicates in your data so that the genotyper doesn't get skewed by potential amplification artifacts when weighing support for a variant.  If your library is reasonably complex and not vastly over-sampled, this should only knock out about 10-20% of your raw data.  If it takes out a higher percentage of your data but you still have significant coverage (e.g. >30x at minimum, or >50x on average) then you should still be fine.  If you have many samples, you can get away with lower per-sample coverage since "Variant Discovery" is done jointly across the entire sample set.

  3. Indel realignment - turns out that insertions & deletions can lead to artifacts like SNPs being called at their edges just because the mapper can't quite figure out where the in/del should be relative to a SNP on a per-read basis.  There are two steps to this - "RealignerTargetCreator" that figures out where realignment should be done and then "IndelRealigner" that does the re-alignment.  It's a two-step process because you can use external information (e.g. the indel catalog provided by the 1000 genomes project) to more sensitively detect possible indel sites.

  4. BaseRecalibrator - re-scores the per-base quality scores to be more accurate.  Raw quality scores in your fastq files come from aggregate measures of images, sequencing-by-synthesis intensities, spatial variation on the flow cell, etc.  Re-calibration takes the mapped raw data, checks how good it is against the given reference AND a list of common SNPs (like dbSNP for human), builds a new error model with this "training set" and provides a "recalibration table" that subsequent GATK tools can use to tweak base quality values.

  5. RR compression is optional - just saves disk space by intelligently throwing away data that is not required downstream.  We have plenty of space at TACC, so I recommend not using it.

  6. Joint Variant Calling can be done by one of two tools: HaplotypeCaller or UnifiedGenotyper.  HaplotyperCaller is the go-to tool for human genomes & exomes.  UnifiedGenotyper is more tweak-able - better for tumor studies (e.g. mixed populations) and if you may have ploidy > 2.   Here is a great blog entry by Craig DeLoughery comparing the two.

  7. Variant Recalibration is the same song, different verse, to Base Recalibration.

  8. GATK has an extensive set of tools for filtering and functional annotation.  SPHS likes many other ways of doing this better (like grep/awk/sed/sort/join for filtering, and annovar for annotation), but if you're sticking to human only analysis then the GATK tools and suggested workflows are probably a good idea.

...

  1. The documentation appears extensive, but it's written for experts - you have to know a lot to make use of their documentation.
  2. Load 
  3. There is only one command, "GenomeAnalysisTK.jar" which as you can see is a Java (64-bit) Jar file, so to run anything the syntax is: 

    Code Block
    java -Xmx4g -jar $TACC_GATK_DIR/GenomeAnalysisTK.jar -T <GATK module you want to run> <other options>

    Implications of this, and other notes:

    1. Watch out for Java memory requirements: -Xmx4g has always worked for me.  Someday it might not.
    2. Tools usually have "inherited options".  These are options that are used across many tools.  They are not documented on the tools page but on the GATK command line page.  This is a neat idea, but makes it hard to track down details for a specific command.
    3. Worse than having split documentation, options aren't always consistent.  For instance, for the IndelRealigner tool uses "-known" while the RealignerTargetCreator uses "--known" (two dashes) instead even though you run them back-to-back.
    4. Multithreading options are really tricky - they may or may not be supported on any given tool.
    5. Broad is constantly working on the software.  Things like the names of options, parameter values, etc. may change over time (though it is increasingly stable) so watch out if you upgrade.
    6. Unlike most programs which fail very quickly (i.e. seconds) if you give them bad input, GATK takes a while to fail - it can take a minute or two just to get a tool going to the point that it fails, and the error will be buried in a lot of other verbose output.
    7. Pay attention to parameters and be willing to invest significant time tuning these to your specific application.  The GATK documentation specifically says to do this - don't use it blindly (like any complex tool).
  4. Almost all the tools are deathly slow - pretty much any step done in GATK will take a long time.  For instance, when I run human exomes at 40 million 2x100 read-pairs per sample, mapping takes only a few hours but the rest of the GATK pipeline can take another 24 hours on a single sample.
  5. There are two ways to "run" GATK tools: a) command line mode, one tool at a time, and b) pipeline mode. - which assumes you are running the LSF job scheduling system, consequently it's not used at TACC (Lonestar uses SGE and Stampede uses Slurm).

...

  1. The best option is to use the ones in $BI if they are there (in $BI/ref_database/GATK) (NOTE: You can join the BioITeam and put them there!!  Free storage this way!).  Otherwise make your own local copy from those provided by GATK.   TACC does not maintain these for the community (yet). 

    Expand
    titleTips on downloading, setting up GATK bundles... These are very advanced commands and listed here only as a reference.

    To pull GATK resource bundles and check them:

    ftp ftp.broadinstitute.org

    username: gsapubftp-anonymous
    <find bundle you want>
    prompt
    bin
    mget *
     

    Then: cd to dir, fix md5 files:

    for file in `ls *.md5`; do sed -i 's/\(\/\).*\(\/\)//' $file; done

    do md5 check:

    for file in `ls *.md5`; do md5sum -c $file; done
  2.  If you are in a non-model system, you are on your own to make the resource bundle.  For instance, if you have a fasta file of your reference then you'll need to use the picard tool CreateSequenceDictionary.jar to make a dictionary of it for GATK.  Use "module load picard" and then "ls $TACC_PICARD_DIR" to see all the picard tools at TACC.

...

Expand
titleHere's an abbreviated SAM header with properly defined read groups
Code Block
titleSAM header for ONLY chr21, but showing valid RG tags for GATK/Picard
@HD	VN:1.0	SO:coordinate
@SQ	SN:chr20	LN:63025520
@PG	ID:bowtie2	PN:bowtie2	VN:2.1.0
@RG	ID:NA12878_R1.fq.bowtie.sorted	PL:ILLUMINA	SM:NA12878	LB:L78
@RG	ID:NA12891_R1.fq.bowtie.sorted	PL:ILLUMINA	SM:NA12891	LB:L91
@RG	ID:NA12892_R1.fq.bowtie.sorted	PL:ILLUMINA	SM:NA12892	LB:L92

 

A working (on Lonestar) GATK pipeline script

Expand
titleClick here to see the pipeline. Remember that you'll need to copy resources to $SCRATCH and update file paths on Stampede.

This short script will run all the many GATK commands on a SINGLE bam file - that's not best practices if you have more than one sample, but it's easy to modify this to do the first few steps separately, then join your BAMs (samtools merge) and run the last few steps.

WARNING: GATK is slow - running this script on only 12 million reads (6 million read-pairs) takes about 2 hours on Lonestar!

Code Block
titleGATK pipeline (also at $BI/scripts)
#!/bin/bash
bamfile=$1


# Start base recalibrator
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar  -T BaseRecalibrator -I $bamfile -R /scratch/01057/sphsmith/hg19/chr20.fasta -knownSites /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/dbsnp_137.hg19.vcf -o $bamfile.recal_data.table &> $bamfile.recal_data.table.log &


# Start picard's mark duplicates
java -Xmx2g -jar /opt/apps/picard/1.107/MarkDuplicates.jar INPUT=$bamfile OUTPUT=$bamfile.dedup.bam METRICS_FILE=$bamfile.dedup.bam.metrics ASSUME_SORTED=TRUE &


# Wait for mark duplicates and base recalibrator to finish - they are independent so can run in parallel
wait


# Index the new de-duplicated bam file for GATK!
samtools index $bamfile.dedup.bam


# Now create the targets for realignment
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar  -T RealignerTargetCreator -BQSR $bamfile.recal_data.table -I $bamfile.dedup.bam -R /scratch/01057/sphsmith/hg19/chr20.fasta --known /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/1000G_phase1.indels.hg19.vcf -o $bamfile.dedup.bam.intervals &> $bamfile.dedup.realigner.log 


# Now do the realignments
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar  -T IndelRealigner -BQSR $bamfile.recal_data.table -I $bamfile.dedup.bam -R /scratch/01057/sphsmith/hg19/chr20.fasta -known /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/1000G_phase1.indels.hg19.vcf -targetIntervals $bamfile.dedup.bam.intervals -o $bamfile.dedup.realigned.bam &> $bamfile.dedup.indelrealigner.log 


# And finally call the genotypes
java -Xmx4g -jar /opt/apps/gatk/2.8.1/GenomeAnalysisTK.jar -T HaplotypeCaller -R /scratch/01057/sphsmith/hg19/chr20.fasta -I $bamfile.dedup.realigned.bam --dbsnp /corral-repl/utexas/BioITeam/ref_database/GATK/2.5/hg19/dbsnp_137.hg19.vcf -BQSR $bamfile.recal_data.table -o $bamfile.dedup.realigned.bam.vcf &> $bamfile.sorted.dedup.realigned.bam.vcf.log 
 


 

Compare samtools to GATK on exome 20.

 

Diversions

 

These are oddities:

 

...