Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Evaluating capture metrics

There are many ways to measure sequence capture.  You might care more about minimizing off-target capture, to make your sequencing dollars go as far as possible.  Or you might care more about maximizing on-target capture, to make sure you get data from every region of interest.  These two are usually negatively correlated.

Using Picard's "CalculateHsMetrics" function to evaluate capture

...

To run the program on Lonestar, there are three prerequisites: 1) A bam file and 2) a list of the genomic intervals that were to be captured and 3) the reference (.fa).  As you would guess, the BAM and interval list both have to be based on exactly the same genomic reference file.

For our tutorial, the bam files are one of these:

Code Block
titleBAM files for exome capture evaluation tutorial
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam  
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

I've started with one of Illumina's target capture definitions (the vendor of your capture kit will provide this) but since the bam files only represent chr21 data I've created a target definitions file from chr21 only as well.  Here they are:

Code Block
titleTwo relevant target list definitions
/corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist
/corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist

And the relevant reference is:

Code Block
titleReference for exome metrics
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai
Code Block
languagebash
titleThis block will work on data you generated in the human trios analysis
collapsetrue
cds
mkdir BDIBGVA_Exome_Capture
cd BDIBGVA_Exome_Capture
cp $SCRATCH/BDIBGVA_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp $SCRATCH/BDIBGVA_Human_tutorial/raw_files/target_intervals.chr20.reduced.withhead.intervallist .
cp $SCRATCH/BDIBGVA_Human_tutorial/raw_files/ref/hs37d5.fa .
cp $SCRATCH/BDIBGVA_Human_tutorial/raw_files/ref/hs37d5.fa.fai .
Code Block
languagebash
titleThis block will work if you have not completed the human trios tutorial
cds
mkdir BDIBGVA_Exome_Capture
cd BDIBGVA_Exome_Capture
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai .

 


The run command looks long but isn't that complicated (like most java programs):

Code Block
titleHow to run exactly these files on Lonestar
module load picard-tools
module load java
java -Xmx4g -Djava.io.tmpdir=/tmp -jar /opt/apps/picard-tools/1.141/picard.jar CalculateHsMetrics BI=target_intervals.chr20.reduced.withhead.intervallist TI=target_intervals.chr20.reduced.withhead.intervallist I=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam R=hs37d5.fa  O=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats

 


The aggregate capture data is in exome.picard.stats, but it's format isn't very nice; here's a linux one-liner to reformat the two useful lines (one is the header, the other is the data) into columns, along with the result:

...

Tip
titleTaking the output even further

It is rare that you ever want to work with a single sample. While this format is nice for a single sample, comparing the same data across multiple samples would not be the easiest to do with this format. Instead, putting this information to a file, then using grep and awk you could make a small table of the specific information you want.

...


Since I don't actually know what capture kit was used to produce these libraries, these may or may not accurately reflect how well the library prep went, but generally speaking having >40x average coverage on your baits (the target regions) is good, as is over 500 fold enrichment. While it may be tempting to consider 52% of reads being 'off bait' as a bad thing, instead consider that ~48% of reads mapped to just ~0.06% of the genome.

...

These results were based on sample NA12878. How do the other 2 samples (NA12891, and NA12892) from the trios tutorial compare for their enrichment? 


Return to GVA2017