Diploid genomes
The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:
Expectations of output are quite different however, which can add statistical power to uncovering variation in populations or organisms with more than two expected variants at the same location. If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.
Trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases. The most prominent publication in this area is this first example of whole exome sequencing saving a life. There are many other publications since and some review articles such as this one. Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation. But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.
Example: The CEU Trio from the 1000 Genomes Project
Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.
...
Code Block | ||
---|---|---|
| ||
cds mkdir BDIBGVA_Human_tutorialtrios cd BDIBGVA_Human_tutorialtrios cp -r $BI/ngs_course/human_variation raw_files |
...
- compressed raw data (the .fastq.gz files)
- mapped data (the .bam files)
- variant calls (the .vcf files)
- the subdirectory
ref
with special references - .bam files containing a subset of mapped human whole exome data are also available on these three; those are the three files "NA*.bam".
- We've pre-run samtools and GATK on each sample individually - those are the *GATK.vcf and *samtools.vcf files.
- We've also pre-run samtools and GATK on the trio, resulting in GATK.all.vcf and samtools.all.vcf.
- The 1000 Genomes project is really oriented to producing .vcf files; the file "ceu20.vcf" contains all the latest genotypes from this trio based on abundant data from the project.
Here we show the steps for:
- Mapping
- Single-sample variant calling with samtools
- Multiple-sample variant calling with samtools
Single-sample variant calling with samtools
We would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq
file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.
For now, the bam file we want to focus on is:
$SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
...
Warning | |||||
---|---|---|---|---|---|
| |||||
This command will take quite a bit of time to complete. While it is running on an idev node, see if you can figure out what each of the options in the mpileup and bcftools commands are doing
|
After your single-sample variant calling job completes
Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.
Multiple-sample variant calling with samtools
This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.
...
By providing separate bam files for each sample. Note that in the following code block, he trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.
Warning title Warning, DO NOT run this command yet. Code Block title samtools multi-sample variants: separate bam files samtools mpileup -uf $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/ref/hs37d5.fa \ $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools call -v -c - > multi_sample/trios_tutorial.all.samtools.vcf
The output file from this option is in$BI/ngs_course/human_variation as all.samtools.vcf
By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools
@RG
tags.Warning title Do not run this command This command comes with a sizable caveat: If you intend to use this option, you must make sure you tag your reads with the right RG tag; this can easily be done during the
samse
orsampe
stage of mapping with bwa with the-r
option, usingsamtools merge
, with picard tools AddOrReplaceReadGroup command, or with your own perl/python/bash commands. The problem is that it requires prior knowledge of all the samples you are going to group together ahead of time, and individual sample input files have to be merged together. There are obviously times that this will be the desired and easier, but for the purpose of this course, we believe it makes more sense to keep all input files separate and only have a single merged output file. As part of the tutorial we do provide the appropriate files necessary for the following command to run.Code Block title samtools multi-sample variants: one or more bam files using @RG samtools mpileup -u -f $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/ref/hs37d5.fa all.bam | bcftools call -v -c - > trios_tutorial.all.raw.vcf
...
Code Block | ||
---|---|---|
| ||
cds cd BDIBGVA_Human_tutorialtrios mkdir multi_sample samtools mpileup -uf $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/ref/hs37d5.fa \ $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ $SCRATCH/BDIBGVA_Human_tutorialtrios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \ | bcftools call -v -c - > multi_sample/trios_tutorial.all.samtools.vcf |
Identify the lineage
If genetics works, you should be able to identify the child based strictly on the genotypes. Can you do it?
...
Expand | ||
---|---|---|
| ||
Overall this data is consistent with column 1 (NA12878) being the child. Lines marked with an * are inconsistant: 34 0/1 0/1 0/0 20 0/1 0/0 0/1 20 0/0 0/1 0/0 14 0/1 1/1 0/0 # middle can't be child 6 0/0 0/1 0/1 4 1/1 1/1 0/0 * No mendelian combination exists 1 1/1 0/1 0/0 * 1 0/1 0/0 0/0 * This is, in fact, the correct assessment - NA12878 is the child. |
This same type of analysis can be done on much larger cohorts, say cohorts of 100 or even 1000s of individuals with known disease state to attempt to identify associations between allelic state and disease state.