Diploid genomes
The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:
...
Trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases. The most prominent publication in this area is this first example of whole exome sequencing saving a life. There are many other publications since and some review articles such as this one. Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation. But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.
Example: The CEU Trio from the 1000 Genomes Project
Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.
...
We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.
Single-sample variant calling with samtools
We would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq
file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.
...
Warning | |||||
---|---|---|---|---|---|
| |||||
This command will take quite a bit of time to complete. While it is running on an idev node, see if you can figure out what each of the options in the mpileup and bcftools commands are doing
|
After your single-sample variant calling job completes
Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.
Multiple-sample variant calling with samtools
This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.
...
This same type of analysis can be done on much larger cohorts, say cohorts of 100 or even 1000s of individuals with known disease state to attempt to identify associations between allelic state and disease state.
Warning | ||
---|---|---|
| ||
STOP HERE UNTIL FIXED |
A bcftools
auxiliary perl script called vcfutils.pl
can provide some useful stats. This is not part of a standard TACC module but it's in our common $BI/bin directory.
Code Block | ||
---|---|---|
| ||
vcfutils.pl qstats trios_tutorial.raw.vcf
|
No Format | ||||
---|---|---|---|---|
| ||||
tacc:/scratch/01821/ded/BDIB_Human_tutorial/samtools_example$ vcfutils.pl qstats trios_tutorial.raw.vcf
186 1011 1011 757 0 2.9803 0.0000 0.0000 2.9803
142 2036 2036 1441 0 2.4218 0.0000 0.0000 2.0059
122 3091 3091 2156 0 2.3059 0.0000 0.0000 2.1029
109 4177 4177 2925 0 2.3363 0.0000 0.0000 2.4259
99.5 5237 5237 3647 0 2.2937 0.0000 0.0000 2.1361
92 6273 6273 4354 0 2.2689 0.0000 0.0000 2.1489
85.5 7328 7328 5105 0 2.2964 0.0000 0.0000 2.4704
79 8352 8352 5811 0 2.2869 0.0000 0.0000 2.2201
74 9369 9369 6497 0 2.2622 0.0000 0.0000 2.0725
69 10553 10553 7311 0 2.2551 0.0000 0.0000 2.2000
65 11782 11782 8176 0 2.2673 0.0000 0.0000 2.3764
61 13059 13059 9090 0 2.2902 0.0000 0.0000 2.5179
57 14280 14280 9945 0 2.2941 0.0000 0.0000 2.3361
53 15301 15301 10656 0 2.2941 0.0000 0.0000 2.2935
48.8 16323 16323 11347 0 2.2803 0.0000 0.0000 2.0876
45 17460 17460 12122 0 2.2709 0.0000 0.0000 2.1409
41.8 18639 18639 12899 0 2.2472 0.0000 0.0000 1.9328
39.8 19660 19660 13572 0 2.2293 0.0000 0.0000 1.9339
37.8 21013 21013 14496 0 2.2243 0.0000 0.0000 2.1538
35.8 22309 22309 15380 0 2.2197 0.0000 0.0000 2.1456
33.8 23568 23568 16251 0 2.2210 0.0000 0.0000 2.2448
31.8 24846 24846 17101 0 2.2080 0.0000 0.0000 1.9860
29.8 26171 26171 17975 0 2.1931 0.0000 0.0000 1.9379
28 27308 27308 18688 0 2.1680 0.0000 0.0000 1.6816
26 28654 28654 19551 0 2.1478 0.0000 0.0000 1.7867
24 29947 29947 20371 0 2.1273 0.0000 0.0000 1.7336
22 31050 31050 21065 0 2.1097 0.0000 0.0000 1.6968
20 32081 32081 21719 0 2.0960 0.0000 0.0000 1.7347
17.1 33251 33251 22446 0 2.0774 0.0000 0.0000 1.6411
13.2 34447 34447 23238 0 2.0732 0.0000 0.0000 1.9604
10.4 35751 35751 24067 0 2.0598 0.0000 0.0000 1.7453
9.53 36772 36772 24724 0 2.0521 0.0000 0.0000 1.8049
8.65 38023 38023 25539 0 2.0457 0.0000 0.0000 1.8693
7.8 39301 39301 26360 0 2.0369 0.0000 0.0000 1.7965
6.98 40665 40665 27196 0 2.0192 0.0000 0.0000 1.5833
6.2 42394 42394 28243 0 1.9958 0.0000 0.0000 1.5352
5.46 44018 44018 29211 0 1.9728 0.0000 0.0000 1.4756
4.77 45553 45553 30168 0 1.9609 0.0000 0.0000 1.6557
4.13 47020 47020 31081 0 1.9500 0.0000 0.0000 1.6480
3.54 48572 48572 32063 0 1.9422 0.0000 0.0000 1.7228
3.01 50463 50463 33139 0 1.9129 0.0000 0.0000 1.3202
|
Undoubtedly, you this looks like nothing but gibberish to you and understandably so, a tab delimited output without headers is rarely if ever useful. It is unclear exactly what the column headings are
Here is an honest write-up about the Ts/Tv metric, referencing this 2002 paper by Ebersberger. Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv). You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.