Annotating Variants: Introduction

As we've already seen, determining the presence or absence of a variant from NGS data is not trivial. It is software dependent and has inherent trade-offs between sensitivity and specificity. Inevitably, the number of putative variants in a real data set is very large; for example the first samples from the 1000 Genomes project typically found 0.1% variants (3 million variants), approximately 10% of which had never been previously observed (300,000 novel variants per individual.) False positive discovery rates are also typically very high at this stage.

Auxiliary data is often used to reduce putative variants without compromising sensitivity. Examples of auxiliary data include other samples within a cohort, existing SNP databases, gene or other feature annotations, and sample-specific information such as pedigree:

By comparing genotypes across a set of samples and defining one as "reference" (or "wild type") enables other samples to be properly genotyped (i.e. 0/0 for hom. WT, 0/1 for het, 1/1 for hom. alt)
Existing SNP databases such as dbSNP or the vcf files from the 1000 genomes project may be used to reject "common" variants under the assertion that "common" means "non-disease causing".
Gene annotations allow for codon analysis to determine whether mutations are synonymous, non-synonymous, nonsense, mis-sense, or create early stop codons.
Pedigree information is particularly effective in mendelian autosomal recessive diseases; filtering for heterozygous mutations in parents and unaffected siblings which are homozygous in the proband usually yields a very small set of candidate variants.

Variant annotation tools perform the function of combining the raw putative variant calls with auxiliary data to add meaning ("annotation") to the variants. In many cases, the variant detection tool itself will add certain elements of annotation, such as a definition of the variant, a genotype call, a measure of likelihood, a haplotype score, and other measures of the raw data useful to reduce false positives. In other cases, the annotator will only require a vcf file combined with other auxiliary data.

Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools. Our objective

Pre-packaged programs

Annovar - one of the most powerful yet simple to run variant annotators available

Annovar is a variant annotator. Given a vcf file from an unknown sample and a host of existing data about genes, other known SNPs, gene variants, etc., Annovar will place the discovered variants in context.

Annovar comes pre-packaged with human auxiliary data which is updated by the authors on a regular basis. It is a well-constructed package in that there is one core program annotate_variation.pl which can perform a variety of different types of annotation AND download the reference databases required.

The authors have also included a wrapper script summarize_anovar.pl which runs a fairly comprehensive set of annotations automatically.

This next exercise will give you some idea of how Annovar works; we've taken the liberty of writing the bash script annovar_pipe.sh around the existing summarize_annovar.pl wrapper (a wrapper within a wrapper - a common trick) to even further simplify the process for this course.

Exercise:

First, look at the code for our annovar_pipe.sh command. Here is an easy one-liner to cat the contents of a script (note ` is a back-tick, not apostrophe):

Print annovar_pipe.sh

cat `which annovar_pipe.sh`

This script simply does a format conversion and then calls summarize_annovar.pl. Now let's run it on all the vcf files - you could simply edit the commands file and type in the 6 lines, or you can use this fancier command line that calls Perl to custom-create the 6 command lines needed and put them straight into commands:

Create the "commands" file to run annovar on six vcf files - use Perl to extract the sample name and mapper so the log files have meaningful, but shorter, names

ls $BI/ngs_course/human_variation/N*.vcf | \
  perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1\n";' \
  > commands

Create the submission script and submit

launcher_creator.py -l annovar.sge -n annovar -t 00:30:00 -j commands
qsub annovar.sge

We have ALREADY pre-computed these outputs (although Annovar will run pretty quickly on data from only chr20). You might want to have a look at the code to annovar_pipe.sh and summarize_annovar.pl. Note that these run Annovar in "gene-based" mode.

If you want a hint, here's a fast way to look at the code to a script

Print out the text of the bash script annovar_pipe.sh

more `which annovar_pipe.sh`

Print out the text of the perl script summarize_annovar.pl

more `which summarize_annovar.pl`

(Note the ` characters are "backtick", not apostrophe)

ANNOVAR output

Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo). It provides all these output files:

Example ANNOVAR output on the NA12878 vcf file

NA12878.chrom20.samtools.vcf.exome_summary.csv
NA12878.chrom20.samtools.vcf.exonic_variant_function
NA12878.chrom20.samtools.vcf.genome_summary.csv
NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_dropped
NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_filtered
NA12878.chrom20.samtools.vcf.hg19_avsift_dropped
NA12878.chrom20.samtools.vcf.hg19_avsift_filtered
NA12878.chrom20.samtools.vcf.hg19_esp5400_all_dropped
NA12878.chrom20.samtools.vcf.hg19_esp5400_all_filtered
NA12878.chrom20.samtools.vcf.hg19_genomicSuperDups
NA12878.chrom20.samtools.vcf.hg19_ljb_all_dropped
NA12878.chrom20.samtools.vcf.hg19_ljb_all_filtered
NA12878.chrom20.samtools.vcf.hg19_phastConsElements46way
NA12878.chrom20.samtools.vcf.hg19_snp132_dropped
NA12878.chrom20.samtools.vcf.hg19_snp132_filtered
NA12878.chrom20.samtools.vcf.log
NA12878.chrom20.samtools.vcf.variant_function

I find the exome_summary.csv to be one of the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):

Func	exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic
Gene	The common gene name
ExonicFunc	frameshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown
AAChange (in gene coordinates)
Conserved (i.e. SNP is in a conserved region)	based on the UCSC 46-way conservation model
SegDup (snp is in a segmental dup. region)
ESP5400_ALL	Alternate Allele Frequency in 3510 NHLBI ESP European American Samples
1000g2010nov_ALL	Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele).
dbSNP132	The id# in dbSNP if it exists
AVSIFT	The AVSIFT score of how deleterious the variant might be
LJB_PhyloP	Conservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative".
LJB_PhyloP_Pred
LJB_SIFT	SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.
LJB_SIFT_Pred
LJB_PolyPhen2	Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15.
LJB_PolyPhen2_Pred
LJB_LRT	Functional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point.
LJB_LRT_Pred
LRT_MutationTaster	Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point.
LRT_MutationTaster_Pred
LJB_GERP++	higher scores are more deleterious
Chr
Start
End
Ref	Reference base
Obs	Observed base-pair or variant
SNP Quality value
filter information
(ALL the VCF info is here!!)
GT:PL:GQ for each file!

Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis! This is one of the many reasons I like Annovar.

Scavenger hunts!

Find the gene with two frameshift deletions in NA12878

Answer is...

DEFB126

(just grep "frameshift" from the exome_summary file)

Test "genetic drift" vs. "functional selection" - e.g. is the distribution of variants different among non-coding regions, synonymous changes in coding regions, and non-synonymous changes in coding regions?

Answer is...

Compare the output of these three commands:

grep intergenic NA12878.chrom20.samtools.vcf.genome_summary.csv | awk 'BEGIN {FS=","} {print $26"\t"$27}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w synonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w nonsynonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20

Do you notice a pattern?

What's the right statistical test to determine whether non-synonymous mutations might be under different selective pressure than intergenic or synonymous mutations from this data?

Other variant annotators:

Notes

Variants consist of single base base changes, insertions and deletions, and larger scale structural changes. "Larger scale" is usually defined relative to the capabilities of the technology; for example, a "small indel" usually means "detectable within a single sequence read". In 2009, sequence reads were about 50 bp but in 2011 they were 100 bp.

Bioinformatics Team (BioITeam) at the University of Texas

Annotating variants with annovar (GVA14)