...
Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf
files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools. Our objective
Pre-packaged programs
Annovar - one of the
...
most powerful yet simple to run variant annotators available
Annovar is a variant annotator. Given a vcf file from an unknown sample and a host of existing data about genes, other known SNPs, gene variants, etc., Annovar will place the discovered variants in context.
...
Code Block | ||
---|---|---|
| ||
launcher_creator.py -l annovar.sge -n annovar -t 00:30:00 -j commands qsub annovar.sge |
While Annovar is running, We have ALREADY pre-computed these outputs (although Annovar will run pretty quickly on data from only chr20). You might want to have a look at the code to annovar_pipe.sh
summarize and summarize_annovar.pl. Note that these run Annovar in "gene-based" mode.
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
(Note the ` characters are "backtick", not apostrophe) |
ANNOVAR output
Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo). It provides all these output files:
Code Block | ||
---|---|---|
| ||
NA12878.chrom20.samtools.vcf.exome_summary.csv
NA12878.chrom20.samtools.vcf.exonic_variant_function
NA12878.chrom20.samtools.vcf.genome_summary.csv
NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_dropped
NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_filtered
NA12878.chrom20.samtools.vcf.hg19_avsift_dropped
NA12878.chrom20.samtools.vcf.hg19_avsift_filtered
NA12878.chrom20.samtools.vcf.hg19_esp5400_all_dropped
NA12878.chrom20.samtools.vcf.hg19_esp5400_all_filtered
NA12878.chrom20.samtools.vcf.hg19_genomicSuperDups
NA12878.chrom20.samtools.vcf.hg19_ljb_all_dropped
NA12878.chrom20.samtools.vcf.hg19_ljb_all_filtered
NA12878.chrom20.samtools.vcf.hg19_phastConsElements46way
NA12878.chrom20.samtools.vcf.hg19_snp132_dropped
NA12878.chrom20.samtools.vcf.hg19_snp132_filtered
NA12878.chrom20.samtools.vcf.log
NA12878.chrom20.samtools.vcf.variant_function |
I find the exome_summary.csv
to be one of the most useful files because it brings together nearly all the useful information. Here are the fields in that file:
Func |
Gene |
ExonicFunc |
AAChange (in gene coordinates) |
Conserved (i.e. SNP is in a conserved region) |
SegDup (snp is in a segmental dup. region) |
ESP5400_ALL |
1000g2010nov_ALL |
dbSNP132 |
AVSIFT |
LJB_PhyloP |
LJB_PhyloP_Pred |
LJB_SIFT |
LJB_SIFT_Pred |
LJB_PolyPhen2 |
LJB_PolyPhen2_Pred |
LJB_LRT |
LJB_LRT_Pred |
LRT_MutationTaster |
LRT_MutationTaster_Pred |
LJB_GERP++ |
Chr |
Start |
End |
Ref |
Obs |
SNP Quality value |
filter information |
DP=raw read depth, VDB= variant distance bias (might be a problem with RNA seq calls), RPB=read position bias (since early/late bp in a read may be worse), AF1=Max-likelihood estimate of the first ALT allele frequency (assuming HWE), HWE=Chi^2 based HWE test P-value based on G3, AC1=Max-likelihood estimate of the first ALT allele count (no HWE assumption), DP4=# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases, MQ=Root-mean-square mapping quality of covering reads, FQ=Phred probability of all samples being the same, PV4=P-values for strand bias, baseQ bias, mapQ bias and tail distance bias |
GT:PL:GQ for each file! |
Other variant annotators:
- http://www.yandell-lab.org/software/vaast.html
- http://www.broadinstitute.org/gatk/gatkdocs/ VariantAnnotator annotations
- http://www.bioconductor.org/help/workflows/variants/
- http://vat.gersteinlab.org/
- http://code.google.com/p/mu2a/
...