Trio (or familial) analysis has been exceptionally powerful for identifying rare childhood diseases. The most prominent publication in this area is this first example of whole exome sequencing saving a life. There are many other publications since and some review articles such as this one. Familial analysis is critical for rare, autosomal dominant diseases because, almost by definition, the mutations may be "private" to each individual so we can't look across big populations to find one single causative mutation. But within families, we can identify bad private mutations in single genes or pathways and then look across populations to find commonality at the gene or pathway level to explain a phenotype.
The initial steps in calling variants for diploid or multi-ploid organisms with NGS data are the same as what we've already seen:
Expectations of output are quite different however, which can add statistical power to uncovering variation in populations or organisms with more than two expected variants at the same location. If you happen to be working with a model organism with extensive external data (ESPECIALLY HUMAN), then there are even more sophisticated tools like the Broad Institute's GATK that can improve both sensitivity and specificity of your variant calls.
Many example datasets are available from the 1000 Genomes Project specifically for method evaluation and training. We'll explore a trio (mom, dad, child). Known as the CEU Trio from the 1000 genomes project their accession numbers are NA12892, NA12891, and NA12878 respectively. To make the exercise run FAR more quickly, we'll focus on data only from chromosome 20.
All the data we'll use is located here:
mkdir $SCRATCH/GVA_Human_trios cd $SCRATCH/GVA_Human_trios cp -r $BI/ngs_course/human_variation raw_files ls raw_files |
This directory contains the following:
ref with special referencesWe would normally use the BAM file from a previous mapping step to call variants in this raw data. However, for the purposes of this course we will use the actual BAM file provided by the 1000 Genomes Project (from which the .fastq file above was derived, leading to some oddities in it). As a bonus tutorial, you could map the data yourself and using what you learned in the bowtie2 tutorial and then use the resultant .bam files.
For now, the bam file we want to focus on is:
$SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam
With samtools, this is a two-step process:
samtools mpileup command transposes the mapped data in a sorted BAM file fully to genome-centric coordinates. It starts at the first base on the first chromosome for which there is coverage and prints out one line per base. Each line has information on every base observed in the raw data at that base position along with a lot of auxiliary information depending on which flags are set. It calculates the Bayseian prior probability given by the data, but does not estimate a real genotype.bcftools with a few options added uses the prior probability distribution and the data to calculate a genotype for the variants detected.It is unlikely that you are currently on an idev node as copying the files while on an idev node causes problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial. DO NOT run these commands on the head node |
Recall that we installed samtools on our GVA2021 conda installation. Make sure you have activated your GVA2021 environment and you have access to samtools and bcftools version 1.9.
samtools --version output:
bcftools --version output:
|
The #1 most likely issue here will be that you see samtools version 1.7 rather than 1.9 and will suggest that somewhere in other modifications of the GVA2021 environment, we introduced a change to either samtools or opensssl. If you experience this error, let me know, I'm trying to track down if this is something that a tutorial specifically caused, or if its an error of installing some other tool into this environment mistakenly.
If you need to do this, make sure you are in the new GVA-samtools conda environment before using any of the samtools commands listed here, or switch to it when you get the error message about libcripto |
mkdir samtools_example cd samtools_example samtools mpileup -u -f $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa $SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam | bcftools call -v -c - > trios_tutorial.raw.vcf |
The above command may take >30 minutes to run based on some student's experience and will not produce many lines of output leading students to worry their terminal has locked up. More than likely this is not the case, but you can try hitting the return key 1 or 2 times to see if your terminal adds a blank line to verify. As this command is taking so long it is recomended that you read ahead or switch to another tutorial in another terminal window while it runs, just be aware that you should not start 2 idev sessions in 2 different windows.
One potential issue with this type of approach is that vcf files only record variation that can be seen with the data provided. When all reads mapping to a given location exactly match the reference (i.e. is homozygous wildtype relative to the reference) there will be no data. Which looks the same as if you had no data in those regions; this leads us to our next topic.
Not being able to tell between no data and wildtype is not the end of the world for a single sample, but if you're actually trying to study human (or other organism) genetics, you must discriminate homozygous WT from a lack of data. This is done by providing many samples to the variant caller simultaneously. This concept extends further to populations; calling variants across a large and diverse population provides a stronger Bayesian prior probability distribution for more sensitive detection.
To instruct samtools to call variants across many samples, you must simply give it mapped data with each sample tagged separately. Samtools allows two methods to do this:
By providing separate bam files for each sample. Note that in the following code block, he trailing \ symbols tells the command line that you are not done entering the command and not to execute any commands until you hit return without a preceding \ mark.
|
The output file from this option is $BI/gva_course/GVA2021.all.samtools.vcf.samtools.vcf if you want to work with it without having to wait on the analysis to run personally.
By providing one or more bam files, each containing mapped reads from multiple samples tagged with unique samtools @RG tags.
This command comes with a sizable caveat: If you intend to use this option, you must make sure you tag your reads with the right RG tag; this can easily be done during the
|
Based on the discussion above, we are selecting the first solution and providing details of how this command should be run. As this command will generate very little output and take ~45 minutes to complete, you are once again reminded that the output file from this option is available $BI/gva_course/GVA2021.all.samtools.vcf if you want to work with it without having to wait on the analysis to run personally.
mkdir $SCRATCH/GVA_Human_trios/multi_sample
cd $SCRATCH/GVA_Human_trios
samtools mpileup -uf $SCRATCH/GVA_Human_trios/raw_files/ref/hs37d5.fa \
$SCRATCH/GVA_Human_trios/raw_files/NA12878.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
$SCRATCH/GVA_Human_trios/raw_files/NA12891.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
$SCRATCH/GVA_Human_trios/raw_files/NA12892.chrom20.ILLUMINA.bwa.CEU.exome.20111114.bam \
| bcftools call -v -c - > multi_sample/trios_tutorial.all.samtools.vcf
|
The observant students who start this command might notice that this time mpileup does give a message that it is working with 3 samples, and 3 input files (while all previous analysis have used 1 file and 1 sample).
If genetics works, you should be able to identify the child based strictly on the genotypes. Can you do it?
You're trying to find the genotypes in the trios_tutorial. |
Notice that if you are working with data provided rather than data generated you may have 2 different file names " |
cat trios_tutorial.all.samtools.vcf | tail -10000 | awk '{if ($6>500) {print $2"\t"$10"\t"$11"\t"$12}}' | grep "0/0" | sed s/':'/'\t'/g | awk '{print $2"\t"$4"\t"$6}' | tail -100 | sort | uniq -c | sort -n -r |
Here are the steps going into this command:
For more information about the individual commands and their options https://explainshell.com/ is a good jumping off point. |
35 0/1 0/1 0/0
20 0/0 0/1 0/0
18 0/1 0/0 0/1
14 0/1 1/1 0/0
6 0/0 0/1 0/1
5 1/1 1/1 0/0
1 1/1 0/1 0/0
1 0/1 0/0 0/0 |
Given this information can you make any determination about family structure for these 3 individuals? The first column is the number of occurrences generated by the uniq -c command in our large 1 liner, with the following 3 columns being the different individual samples. So consider that while some lines may not show a viable mendelian inheritance pattern, you should weight things according to how many times each scenario occurred as our filtering was fairly limited.
Overall this data is consistent with column 1 (NA12878) being the child. Lines marked with an * are inconsistent: 35 0/1 0/1 0/0
20 0/0 0/1 0/0This is, in fact, the correct assessment - NA12878 is the child. |
samtools mpileup commands used in this tutorial all produce the "samtools mpileup option `u` is functional, but deprecated. Please switch to using bcftools mpileup in future." warning we discussed in the SNV tutorial. As in that tutorial, we have ignored this as the samtools mpileup command appeared to function faster than the corresponding bcftools mpileup command. Can you repeat this exercise using bcftools instead? How do the results compare.