/
Exome Capture Metrics GVA2022

Exome Capture Metrics GVA2022

Data used in this tutorial

Recommended but not required that you first complete the trios tutorial and use the data you generated here. Alternatively, canned data provided.

Overview

As we discussed during some of our variant calling, the majority of the reads, and bases within reads, will map perfectly with the reference genome (unless you are using a reference genome that needs some improvement in which case, you should work towards that end possibly with the spades or novel DNA tutorials as command line jump offs. As so much of the reads are just telling you what you already know, it is possible to design your experiments to target reads to a given location or sets of locations in the genome via sequence capture. One of the most common things to target is the exome, or a portion of the exome as mutations in exons are more likely to have a functional consequence than mutations in introns or intergenic regions.

As we mentioned in our discussions on experimental design, it is always better to start from the best data possible by conducting a well planned out experiment. Part of that when working with exome (or other targeted sequencing approaches) is to make sure that the enrichment that you are doing is successful, and that you know how much sequencing you need to do to have good coverage of what you are looking at. There are actually several different ways to measure sequence capture depending on what you are doing.  You might care more about minimizing off-target capture, to make your sequencing dollars go as far as possible.  Or you might care more about maximizing on-target capture, to make sure you get data from every region of interest.  These two are usually negatively correlated.

Learning objectives

  1. Install picard to a conda environment.
  2. Use picard's CollectHsMetrics function to determine what fraction of reads were associated with the exome of chromosome 20.

Picard

Picard is another tool like like gatk also put out by the broad. As of gatk version 4.0, gatk began bundling picard with all of its distributions. This means if you have already done the gatk tutorial you already have access to picard! Alternatively, as picard is just a subset of the gatk package, it may be of interest for you to only install picard tools rather than the full suite. Further, conda offers a "picard-slim" packaging that includes most of the picard tools, but not those few that would otherwise require an associated R program.

Installing Picard

Here I will present 2 different methods of installing/accessing picard and its associated tools. The most basic guidance I have is that I don't see a lot of downside to installing the full gatk package unless you know you only need a limited number of tools that are in picard, and don't want to 'clutter' up your environment.

  1. If you wish to install picard as part of the gatk package (allows you to access commands via gatk toolname)

    As was done in our gatk tutorial, the commands for putting gatk in its own environment and activating it are below. The gatk tutorial contains slightly more information about this installation if you are interested.
    conda create --name GVA-gatk -c bioconda gatk4
    conda activate GVA-gatk
    gatk --version
    The Genome Analysis Toolkit (GATK) v4.2.6.1
    HTSJDK Version: 2.24.1
    Picard Version: 2.27.1



  2. If you wish to install picard as a stand alone package (allows you to access commands via picard toolname). Since the only reason that jumps out at me to do this is to not "clutter" things, we'll go with the "slim" package, though changing to the full picard package would not be difficult.

    conda create --name GVA-picard -c bioconda picard-slim
    conda activate GVA-picard
    picard --version

    The version command above will actually print the picard help file (aka list of all picard tools). When conda was downloading the package, it appears we are getting version 2.27.3, but I don't see any way to directly access picard's installed version from the command line, though individual tool's version can be access via picard ToolName --version. The listing of all of picard's tools tells us that we have successfully installed picard though.

Get some data

Here is a link to the full picard documentation and here is a link to the CollectHsMetrics tool

To run CollectHsMetrics, there are three prerequisites: 1) A bam file and 2) a list of the genomic intervals that were to be captured and 3) the reference (.fa).  As you would guess, the BAM and interval list both have to be based on exactly the same genomic reference file.

For our tutorial, the bam files are one of these:

BAM files for exome capture evaluation tutorial
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam  
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

I've started with one of Illumina's target capture definitions (the vendor of your capture kit will provide this) but since the bam files only represent chr20 data I've created a target definitions file from chr20 only as well.  Here they are: