Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Diploid genomes

The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:

...

Trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases.  The most prominent publication in this area is this first example of whole exome sequencing saving a life.  There are many other publications since and some review articles such as this one.  Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation.  But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.

Example: The CEU Trio from the 1000 Genomes Project

Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.

...

We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.

Single-sample variant calling with samtools

We would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.

...

Warning
titleMake sure you are on an idev node, or submit as job

This command will take quite a bit of time to complete. While it is running on an idev node, see if you can figure out what each of the options in the mpileup and bcftools commands are doing

Code Block
titleCalling variants using samtools and bcftools
cds 
cd BDIB_Human_tutorial
mkdir samtools_example
cd samtools_example
module unload samtools
samtools mpileup -uf $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools view -vcg - > trios_tutorial.raw.vcf

 

After your single-sample variant calling job completes

 

Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.

Multiple-sample variant calling with samtools

This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.

...

This same type of analysis can be done on much larger cohorts, say cohorts of 100 or even 1000s of individuals with known disease state to attempt to identify associations between allelic state and disease state.

 

 

 

 

Warning
titleThis seems broken and confused

STOP HERE UNTIL FIXED

 

 

bcftools auxiliary perl script called vcfutils.pl can provide some useful stats. This is not part of a standard TACC module but it's in our common $BI/bin directory.

Code Block
titleRunning vcfutils.pl
vcfutils.pl qstats trios_tutorial.raw.vcf
No Format
titleExpected output of vcfutils.pl script
collapsetrue
tacc:/scratch/01821/ded/BDIB_Human_tutorial/samtools_example$ vcfutils.pl qstats trios_tutorial.raw.vcf
186	1011	1011	757	0	2.9803	0.0000	0.0000	2.9803
142	2036	2036	1441	0	2.4218	0.0000	0.0000	2.0059
122	3091	3091	2156	0	2.3059	0.0000	0.0000	2.1029
109	4177	4177	2925	0	2.3363	0.0000	0.0000	2.4259
99.5	5237	5237	3647	0	2.2937	0.0000	0.0000	2.1361
92	6273	6273	4354	0	2.2689	0.0000	0.0000	2.1489
85.5	7328	7328	5105	0	2.2964	0.0000	0.0000	2.4704
79	8352	8352	5811	0	2.2869	0.0000	0.0000	2.2201
74	9369	9369	6497	0	2.2622	0.0000	0.0000	2.0725
69	10553	10553	7311	0	2.2551	0.0000	0.0000	2.2000
65	11782	11782	8176	0	2.2673	0.0000	0.0000	2.3764
61	13059	13059	9090	0	2.2902	0.0000	0.0000	2.5179
57	14280	14280	9945	0	2.2941	0.0000	0.0000	2.3361
53	15301	15301	10656	0	2.2941	0.0000	0.0000	2.2935
48.8	16323	16323	11347	0	2.2803	0.0000	0.0000	2.0876
45	17460	17460	12122	0	2.2709	0.0000	0.0000	2.1409
41.8	18639	18639	12899	0	2.2472	0.0000	0.0000	1.9328
39.8	19660	19660	13572	0	2.2293	0.0000	0.0000	1.9339
37.8	21013	21013	14496	0	2.2243	0.0000	0.0000	2.1538
35.8	22309	22309	15380	0	2.2197	0.0000	0.0000	2.1456
33.8	23568	23568	16251	0	2.2210	0.0000	0.0000	2.2448
31.8	24846	24846	17101	0	2.2080	0.0000	0.0000	1.9860
29.8	26171	26171	17975	0	2.1931	0.0000	0.0000	1.9379
28	27308	27308	18688	0	2.1680	0.0000	0.0000	1.6816
26	28654	28654	19551	0	2.1478	0.0000	0.0000	1.7867
24	29947	29947	20371	0	2.1273	0.0000	0.0000	1.7336
22	31050	31050	21065	0	2.1097	0.0000	0.0000	1.6968
20	32081	32081	21719	0	2.0960	0.0000	0.0000	1.7347
17.1	33251	33251	22446	0	2.0774	0.0000	0.0000	1.6411
13.2	34447	34447	23238	0	2.0732	0.0000	0.0000	1.9604
10.4	35751	35751	24067	0	2.0598	0.0000	0.0000	1.7453
9.53	36772	36772	24724	0	2.0521	0.0000	0.0000	1.8049
8.65	38023	38023	25539	0	2.0457	0.0000	0.0000	1.8693
7.8	39301	39301	26360	0	2.0369	0.0000	0.0000	1.7965
6.98	40665	40665	27196	0	2.0192	0.0000	0.0000	1.5833
6.2	42394	42394	28243	0	1.9958	0.0000	0.0000	1.5352
5.46	44018	44018	29211	0	1.9728	0.0000	0.0000	1.4756
4.77	45553	45553	30168	0	1.9609	0.0000	0.0000	1.6557
4.13	47020	47020	31081	0	1.9500	0.0000	0.0000	1.6480
3.54	48572	48572	32063	0	1.9422	0.0000	0.0000	1.7228
3.01	50463	50463	33139	0	1.9129	0.0000	0.0000	1.3202

Undoubtedly, you this looks like nothing but gibberish to you and understandably so, a tab delimited output without headers is rarely if ever useful. It is unclear exactly what the column headings are

 

Here is an honest write-up about the Ts/Tv metricreferencing this 2002 paper by Ebersberger.  Bottom line: between about 2.1 and 2.8 is OK for Ts/Tv (also called Ti/Tv). You can also get some quick stats with some linux one-liners on this page; there are more thorough analysis programs built to work with vcf's.