Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Note
titleData used in this tutorial

Recommended but not required that you first complete the trios tutorial and use the data you generated here. Alternatively, canned data provided.

...

As we discussed during some of our variant calling, the majority of the reads, and bases within reads, will map perfectly with the reference genome (unless you are using a reference genome that needs some improvement in which case, you should work towards that end possibly with the spades or novel DNA tutorials as command line jump offs. As so much of the reads are just telling you what you already know, it is possible to design your experiments to target reads to a given location or sets of locations in the genome via sequence capture. One of the most common things to target is the exome, or a portion of the exome as mutations in exons are more likely to have a functional consequence than mutations in introns or intergenic regions.

...

Picard is another tool like like gatk also put out by the broad. As of gatk version 4.0, gatk began bundling picard with all of its distributions. This means if you have already done the gatk tutorial you already have access to picard! Alternatively, as picard is just a subset of the gatk package, it may be of interest for you to only install picard tools rather than the full suite. Further, conda offers a "picard-slim" packaging that includes most of the picard tools, but not those few that would otherwise require an associated R program.

...

  1. If you wish to install picard as part of the gatk package (allows you to access commands via gatk toolname)

    Code Block
    languagebash
    titleAs was done in our gatk tutorial, the commands for putting gatk in its own environment and activating it are below. The gatk tutorial contains slightly more information about this installation if you are interested.
    conda create --name GVA-gatk -c conda-forge -c bioconda gatk4=4.4.0
    conda activate GVA-gatk
    gatk --version


    No Format
    The Genome Analysis Toolkit (GATK) v4.24.60.10
    HTSJDK Version: 23.240.15
    Picard Version: 23.270.10



  2. If you wish to install picard as a stand alone package (allows you to access commands via picard toolname). Since the only reason that jumps out at me to do this is to not "clutter" things, we'll go with the "slim" package, though changing to the full picard package would not be difficult.

    Code Block
    languagebash
    conda create --name GVA-picard -c conda-forge -c bioconda picard-slim
    conda activate GVA-picard
    picard --version

    The version command above will actually print the picard help file (aka list of all picard tools). When conda was downloading the package, it appears we are getting version 23.270.30, but I don't see any way to directly access picard's installed version from the command line, though individual tool's version can be access via picard ToolName --version. The listing of all of picard's tools tells us that we have successfully installed picard though.

...

To run CollectHsMetrics, there are three prerequisites: 1) A bam file and 2) a list of the genomic intervals that were to be captured and 3) the reference (.fa).  As you would guess, the BAM and interval list both have to be based on exactly the same genomic reference file.

For our tutorial, the bam files are one of these:

Code Block
titleBAM files for exome capture evaluation tutorial
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam  
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
/corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam

I've started with one of Illumina's target capture definitions (the vendor of your capture kit will provide this) but since the bam files only represent chr20 data I've created a target definitions file from chr20 only as well.  Here they are:

Code Block
titleTwo relevant target list definitions
/corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist
/corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist

And the relevant reference is:

Code Block
titleReference for exome metrics
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa
/corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai


Note
titlePossible errors on idev nodes

As mentioned yesterday, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.


Code Block
languagebash
titleThis block will work on data you used in the human trios analysis
mkdir $SCRATCH/GVA_Exome_Capture
cd $SCRATCH/GVA_Exome_Capture
cp $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp $SCRATCH/GVA_Human_trios/raw_files/target_intervals.chr20.reduced.withhead.intervallist .
cp $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa .
cp $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa.fai .

...

Code Block
languagebash
titleThis block will work if you have not completed the human trios tutorial
collapsetrue
mkdir $SCRATCH/GVA_Exome_Capture
cd $SCRATCH/GVA_Exome_Capture
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.chr20.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/target_intervals.reduced.withhead.intervallist .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa .
cp /corral-repl/utexas/BioITeam/ngs_course/human_variation/ref/hs37d5.fa.fai .


Using Picard's "CollectHsMetrics" function to evaluate capture

...

Warning
titleAs with other gatk based commands we have run, these commands while quick (2-3 minutes) will not finish on the head node due to memory problems

Make sure you are on an idev node with hostname or showq -u commands. Additional information for launching an idev node is found here.


The following command is rather large, note that there are a total of 4 commands in the next window: 2 corresponding to loading the correct environment on the compute node, and 2 corresponding to CollectHsMetrics commands. Make sure your copy paste does not introduce additional line returns.

Code Block
languagebash
titleUse one of the following 2 sets of commands depending on which installation you chose
linenumberstrue
conda activate GVA-gatk
gatk CollectHsMetrics BAIT_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist TARGET_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist INPUT=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam REFERENCE_SEQUENCE=hs37d5.fa  OUTPUT=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats
#
conda activate GVA-picard
picard CollectHsMetrics BAIT_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist TARGET_INTERVALS=target_intervals.chr20.reduced.withhead.intervallist INPUT=NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam REFERENCE_SEQUENCE=hs37d5.fa OUTPUT=exome.picard.stats PER_TARGET_COVERAGE=exome.pertarget.stats

Both commands will produce identical results, and nearly identical output to the screen when run (provided you have installed things correctly). Your output is expected to be similar to this which was observed last year:

No Format
INFO	2022-06-22 23:04:42	CollectHsMetrics	Processed     1,000,000 records.  Elapsed time: 00:00:53s.  Time for last 1,000,000:   53s.  Last read position: 20:14,922,373
INFO	2022-06-22 23:05:31	CollectHsMetrics	Processed     2,000,000 records.  Elapsed time: 00:01:43s.  Time for last 1,000,000:   49s.  Last read position: 20:32,404,938
INFO	2022-06-22 23:06:24	CollectHsMetrics	Processed     3,000,000 records.  Elapsed time: 00:02:36s.  Time for last 1,000,000:   52s.  Last read position: 20:43,530,401
INFO	2022-06-22 23:07:16	CollectHsMetrics	Processed     4,000,000 records.  Elapsed time: 00:03:27s.  Time for last 1,000,000:   51s.  Last read position: 20:56,062,712
INFO	2022-06-22 23:07:40	TheoreticalSensitivity	Creating Roulette Wheel
INFO	2022-06-22 23:07:41	TheoreticalSensitivity	Calculating quality sums from quality sampler
INFO	2022-06-22 23:07:41	TheoreticalSensitivity	0 sampling iterations completed
INFO	2022-06-22 23:07:41	TheoreticalSensitivity	1000 sampling iterations completed
INFO	2022-06-22 23:07:41	TheoreticalSensitivity	2000 sampling iterations completed
INFO	2022-06-22 23:07:42	TheoreticalSensitivity	3000 sampling iterations completed
INFO	2022-06-22 23:07:42	TheoreticalSensitivity	4000 sampling iterations completed
INFO	2022-06-22 23:07:43	TheoreticalSensitivity	5000 sampling iterations completed
INFO	2022-06-22 23:07:43	TheoreticalSensitivity	6000 sampling iterations completed
INFO	2022-06-22 23:07:44	TheoreticalSensitivity	7000 sampling iterations completed
INFO	2022-06-22 23:07:44	TheoreticalSensitivity	8000 sampling iterations completed
INFO	2022-06-22 23:07:45	TheoreticalSensitivity	9000 sampling iterations completed
INFO	2022-06-22 23:07:45	TheoreticalSensitivity	Calculating theoretical het sensitivity
INFO	2022-06-22 23:07:46	TargetMetricsCollector	Calculating GC metrics
[Wed Jun 22 23:07:47 CDT 2022] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 4.09 minutes.
Runtime.totalMemory()=1581252608
Tool returned:
0

...

Here is the output of the above command for a 2023 run, DO NOT! paste this into the command line.

No Format
BAIT_SET	target_intervals
BAIT_TERRITORY	1843371
BAIT_DESIGN_EFFICIENCY	1
ON_BAIT_BASES	94769476
NEAR_BAIT_BASES	54852614
OFF_BAIT_BASES	160815315
PCT_SELECTED_BASES	0.481972
PCT_OFF_BAIT	0.518028
ON_BAIT_VS_SELECTED	0.633392
MEAN_BAIT_COVERAGE	51.410962
PCT_USABLE_BASES_ON_BAIT	0.272266
PCT_USABLE_BASES_ON_TARGET	0.212359
FOLD_ENRICHMENT	519.58801
HS_LIBRARY_SIZE	5325189
HS_PENALTY_10X	225.120249
HS_PENALTY_20X	-1
HS_PENALTY_30X	-1
HS_PENALTY_40X	-1
HS_PENALTY_50X	-1
HS_PENALTY_100X	-1
TARGET_TERRITORY	1843371
GENOME_SIZE	3137454505
TOTAL_READS	4579959
PF_READS	4579959
PF_BASES	348076884
PF_UNIQUE_READS	4208881
PF_UQ_READS_ALIGNED	4161913
PF_BASES_ALIGNED	310437405
PF_UQ_BASES_ALIGNED	286584646
ON_TARGET_BASES	73917336
PCT_PF_READS	1
PCT_PF_UQ_READS	0.918978
PCT_PF_UQ_READS_ALIGNED	0.988841
MEAN_TARGET_COVERAGE	40.099001
MEDIAN_TARGET_COVERAGE	7
MAX_TARGET_COVERAGE	1293
MIN_TARGET_COVERAGE	0
ZERO_CVG_TARGETS_PCT	0.005348
PCT_EXC_DUPE	0.076836
PCT_EXC_ADAPTER	0
PCT_EXC_MAPQ	0.018709
PCT_EXC_BASEQ	0.04137
PCT_EXC_OVERLAP	0.084969
PCT_EXC_OFF_TARGET	0.540013
FOLD_80_BASE_PENALTY	20.049501
PCT_TARGET_BASES_1X	0.924958
PCT_TARGET_BASES_2X	0.820112
PCT_TARGET_BASES_10X	0.473017
PCT_TARGET_BASES_20X	0.423769
PCT_TARGET_BASES_30X	0.385684
PCT_TARGET_BASES_40X	0.34762
PCT_TARGET_BASES_50X	0.308718
PCT_TARGET_BASES_100X	0.137769
PCT_TARGET_BASES_250X	0.011475
PCT_TARGET_BASES_500X	0.000228
PCT_TARGET_BASES_1000X	0.000022
PCT_TARGET_BASES_2500X	0
PCT_TARGET_BASES_5000X	0
PCT_TARGET_BASES_10000X	0
PCT_TARGET_BASES_25000X	0
PCT_TARGET_BASES_50000X	0
PCT_TARGET_BASES_100000X	0
AT_DROPOUT	2.023144
GC_DROPOUT	10.378868
HET_SNP_SENSITIVITY	0.720608
HET_SNP_Q	6
SAMPLE	
LIBRARY	
READ_GROUP	


Tip
titleTaking the output even further

It is rare that you ever want to work with a single sample. While this format is nice for a single sample, comparing the same data across multiple samples would not be the easiest to do with this format. Instead, putting this information to a file, then using grep and awk you could make a small table of the specific information you want.

...

These results were based on sample NA12878. How do the other 2 samples (NA12891, and NA12892) from the trios tutorial compare for their enrichment? If you are unsure how differences in enrichment could effect analysis, this is a great discussion topic.


Return to GVA2022GVA2023 course page.