Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

I find the exome_summary.csv to be one of the most useful files because it brings together nearly all the useful information.  Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):

 

Funcexonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic
GeneThe common gene name
ExonicFuncframeshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown
AAChange (in gene coordinates) 
Conserved (i.e. SNP is in a conserved region)based on the UCSC 46-way conservation model
SegDup (snp is in a segmental dup. region) 
ESP5400_ALLAlternate Allele Frequency in 3510 NHLBI ESP European American Samples
1000g2010nov_ALL

Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele).

dbSNP132The id# in dbSNP if it exists
AVSIFTThe AVSIFT score of how deleterious the variant might be
LJB_PhyloPConservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative". 
LJB_PhyloP_Pred 
LJB_SIFTSIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence.  Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated. 
LJB_SIFT_Pred 
LJB_PolyPhen2

Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP  (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15.

LJB_PolyPhen2_Pred 
LJB_LRTFunctional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point. 
LJB_LRT_Pred 
LRT_MutationTaster

Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP  (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point.  

LRT_MutationTaster_Pred 
LJB_GERP++higher scores are more deleterious
Chr 
Start 
End 
RefReference base
ObsObserved base-pair or variant
SNP Quality value 
filter information
DP=raw read depth, VDB= variant distance bias (might be a problem with RNA seq calls), RPB=read position bias (since early/late bp in a read may be worse), AF1=Max-likelihood estimate of the first ALT allele frequency (assuming HWE), HWE=Chi^2 based HWE test P-value based on G3, AC1=Max-likelihood estimate of the first ALT allele count (no HWE assumption), DP4=# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases, MQ=Root-mean-square mapping quality of covering reads, FQ=Phred probability of all samples being the same, PV4=P-values for strand bias, baseQ bias, mapQ bias and tail distance bias
GT:PL:GQ for each file! 
(ALL the VCF info is here!!) 
GT:PL:GQ for each file! 

 

Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis!  This is one of the many reasons I like Annovar.

Scavenger hunts!

Find the gene with two frameshift deletions in NA12878

Expand
titleAnswer is...

DEFB126

(just grep "frameshift" from the exome_summary file)

 

Test "genetic drift" vs. "functional selection" - e.g. is the distribution of variants different among non-coding regions, synonymous changes in coding regions, and non-synonymous changes in coding regions?

Expand
titleAnswer is...

Compare the output of these three commands:

Code Block
grep intergenic NA12878.chrom20.samtools.vcf.genome_summary.csv | awk 'BEGIN {FS=","} {print $26"\t"$27}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w synonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w nonsynonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20

Do you notice a pattern?

What's the right statistical test to determine whether non-synonymous mutations might be under different selective pressure than intergenic or synonymous mutations from this data?

Other variant annotators:

...