Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.

Diploid genomes

The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:


Trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases.  The most prominent publication in this area is this first example of whole exome sequencing saving a life.  There are many other publications since and some review articles such as this one.  Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation.  But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.

Example: The CEU Trio from the 1000 Genomes Project

Many example datasets are available from the 1000 genomes project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run more quickly, we'll focus on data only from chromosome 20.


We'll return to this example data later to demonstrate a much more involved tool, GATK, to do the same steps in another tutorial.

Single-sample variant calling with samtools

We would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.


titleMake sure you are on an idev node, or submit as job

This command will take quite a bit of time to complete. While it is running on an idev node, see if you can figure out what each of the options in the mpileup and bcftools commands are doing

Code Block
titleCalling variants using samtools and bcftools
cd BDIB_Human_tutorial
mkdir samtools_example
cd samtools_example
module unload samtools
samtools mpileup -uf $SCRATCH/BDIB_Human_tutorial/raw_files/ref/hs37d5.fa $SCRATCH/BDIB_Human_tutorial/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools view -vcg - > trios_tutorial.raw.vcf


After your single-sample variant calling job completes


Keep in mind that variant files only record variation that can be seen with the data provided. Where ever sample sequence exactly matches the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.

Multiple-sample variant calling with samtools

This is all fine, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.
