Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Reverted from v. 1
















titleData used in this tutorial

Recommended but not required that you first complete the trios tutorial and use the data you generated here. Alternatively, canned data provided.


As we discussed during some of our variant calling, the majority of the reads, and bases within reads, will map perfectly with the reference genome (unless you are using a reference genome that needs some improvement in which case, you should work towards that end possibly with the spades or novel DNA tutorials as command line jump offs. As so much of the reads are just telling you what you already know, it is possible to design your experiments to target reads to a given location or sets of locations in the genome via sequence capture. One of the most common things to target is the exome, or a portion of the exome as mutations in exons are more likely to have a functional consequence than mutations in introns or intergenic regions.

As we mentioned in our discussions on experimental design, it is always better to start from the best data possible by conducting a well planned out experiment. Part of that when working with exome (or other targeted sequencing approaches) is to make sure that the enrichment that you are doing is successful, and that you know how much sequencing you need to do to have good coverage of what you are looking at. There are actually several different Evaluating capture metrics

There are many ways to measure sequence capture depending on what you are doing.  You might care more about minimizing off-target capture, to make your sequencing dollars go as far as possible.  Or you might care more about maximizing on-target capture, to make sure you get data from every region of interest.  These two are usually negatively correlated.


Using Picard's "CollectHsMetrics" function to evaluate capture

Here is a link to the full picard documentation and here is a link to the CollectHsMetrics tool

To run CollectHsMetrics on Lonestar, there are three prerequisites: 1) A bam file and 2) a list of the genomic intervals that were to be captured and 3) the reference (.fa).  As you would guess, the BAM and interval list both have to be based on exactly the same genomic reference file.

For our tutorial, the bam files are one of these:

Code Block
titleBAM files for exome capture evaluation tutorial

I've started with one of Illumina's target capture definitions (the vendor of your capture kit will provide this) but since the bam files only represent chr21 data I've created a target definitions file from chr21 only as well.  Here they are:

Code Block
titleTwo relevant target list definitions

And the relevant reference is:

Code Block
titleReference for exome metrics


Code Block
titleThis block will work if you have not completed the human trios tutorial
mkdir $SCRATCH/GVA_Exome_Capture
cd $SCRATCH/GVA_Exome_Capture
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai .

The run command looks long but isn't that complicated (like most java programs):

Code Block
titleHow to run exactly these files on Lonestar
java -Xmx4g -jar /corral-repl/utexas/BioITeam/bin/picard.jar CollectHsMetrics BAIT_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist TARGET_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist INPUT=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam REFERENCE_SEQUENCE=hs37d5.fa  OUTPUT=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats


Since I don't actually know what capture kit was used to produce these libraries, these may or may not accurately reflect how well the library prep went, but generally speaking having >40x average coverage on your baits (the target regions) is good, as is over 500 fold enrichment. While it may be tempting to consider 52% of reads being 'off bait' as a bad thing, instead consider that ~48% of reads mapped to just ~0.06% of the genome.

Additional Exercises:

These results were based on sample NA12878. How do the other 2 samples (NA12891, and NA12892) from the trios tutorial compare for their enrichment?
