...
Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf
files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools. Our objective
Pre-packaged programs
Annovar - one of the
...
most powerful yet simple to run variant annotators available
Annovar is a variant annotator. Given a vcf file from an unknown sample and a host of existing data about genes, other known SNPs, gene variants, etc., Annovar will place the discovered variants in context.
...
Code Block | ||
---|---|---|
| ||
launcher_creator.py -l annovar.sge -n annovar -t 00:30:00 -j commands qsub annovar.sge |
While Annovar is running, We have ALREADY pre-computed these outputs (although Annovar will run pretty quickly on data from only chr20). You might want to have a look at the code to annovar_pipe.sh
summarize and summarize_annovar.pl. Note that these run Annovar in "gene-based" mode.
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
(Note the ` characters are "backtick", not apostrophe) |
ANNOVAR output
Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo). It provides all these output files:
Code Block | ||
---|---|---|
| ||
NA12878.chrom20.samtools.vcf.exome_summary.csv
NA12878.chrom20.samtools.vcf.exonic_variant_function
NA12878.chrom20.samtools.vcf.genome_summary.csv
NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_dropped
NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_filtered
NA12878.chrom20.samtools.vcf.hg19_avsift_dropped
NA12878.chrom20.samtools.vcf.hg19_avsift_filtered
NA12878.chrom20.samtools.vcf.hg19_esp5400_all_dropped
NA12878.chrom20.samtools.vcf.hg19_esp5400_all_filtered
NA12878.chrom20.samtools.vcf.hg19_genomicSuperDups
NA12878.chrom20.samtools.vcf.hg19_ljb_all_dropped
NA12878.chrom20.samtools.vcf.hg19_ljb_all_filtered
NA12878.chrom20.samtools.vcf.hg19_phastConsElements46way
NA12878.chrom20.samtools.vcf.hg19_snp132_dropped
NA12878.chrom20.samtools.vcf.hg19_snp132_filtered
NA12878.chrom20.samtools.vcf.log
NA12878.chrom20.samtools.vcf.variant_function |
I find the exome_summary.csv
to be one of the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):
Func | exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic |
Gene | The common gene name |
ExonicFunc | frameshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown |
AAChange (in gene coordinates) | |
Conserved (i.e. SNP is in a conserved region) | based on the UCSC 46-way conservation model |
SegDup (snp is in a segmental dup. region) | |
ESP5400_ALL | Alternate Allele Frequency in 3510 NHLBI ESP European American Samples |
1000g2010nov_ALL | Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele). |
dbSNP132 | The id# in dbSNP if it exists |
AVSIFT | The AVSIFT score of how deleterious the variant might be |
LJB_PhyloP | Conservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative". |
LJB_PhyloP_Pred | |
LJB_SIFT | SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated. |
LJB_SIFT_Pred | |
LJB_PolyPhen2 | Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15. |
LJB_PolyPhen2_Pred | |
LJB_LRT | Functional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point. |
LJB_LRT_Pred | |
LRT_MutationTaster | Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point. |
LRT_MutationTaster_Pred | |
LJB_GERP++ | higher scores are more deleterious |
Chr | |
Start | |
End | |
Ref | Reference base |
Obs | Observed base-pair or variant |
SNP Quality value | |
filter information | |
(ALL the VCF info is here!!) | |
GT:PL:GQ for each file! |
Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis! This is one of the many reasons I like Annovar.
Scavenger hunts!
Find the gene with two frameshift deletions in NA12878
Expand | ||
---|---|---|
| ||
DEFB126 (just grep "frameshift" from the exome_summary file) |
Test "genetic drift" vs. "functional selection" - e.g. is the distribution of variants different among non-coding regions, synonymous changes in coding regions, and non-synonymous changes in coding regions?
Expand | ||
---|---|---|
| ||
Compare the output of these three commands:
Do you notice a pattern? What's the right statistical test to determine whether non-synonymous mutations might be under different selective pressure than intergenic or synonymous mutations from this data? |
Other variant annotators:
- http://www.yandell-lab.org/software/vaast.html
- http://www.broadinstitute.org/gatk/gatkdocs/ VariantAnnotator annotations
- http://www.bioconductor.org/help/workflows/variants/
- http://vat.gersteinlab.org/
- http://code.google.com/p/mu2a/
...