Content Comparison

...

Try to modify the previous code block to run in a new directory called BDIB_Annovar with from the .vcf files from the 3 individuals for both samtools and gatk that we looked at yesterday. Hint: you copied these files into your $SCRATCH/BDIB_Human_tutorial/raw_files directory yesterday.

Code Block

title	Create the submission script and submit

launcher_creator.py -l annovar.sge -n annovar -t 00:30:00 -j commands -A UT-2015-05-18
qsub annovar.sge

We have ALREADY pre-computed these outputs (although Annovar will run pretty quickly on data from only chr20).

Expand

title	In the mean time, you might want to have a look at the code to annovar_pipe.sh and summarize_annovar.pl. Note that these run Annovar in "gene-based" mode.

Again note the ` characters are "backtick", not apostrophes

Code Block

title	Print out the text of the bash script annovar_pipe.sh

more `which annovar_pipe.sh`

Code Block

title	Print out the text of the perl script summarize_annovar.pl

more `which summarize_annovar.pl`

ANNOVAR output

Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo). It provides all these output files:

NA12878.chrom20.samtools.vcf.exome_summary.csv NA12878.chrom20.samtools.vcf.exonic_variant_function NA12878.chrom20.samtools.vcf.genome_summary.csv NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_dropped NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_filtered NA12878.chrom20.samtools.vcf.hg19_avsift_dropped NA12878.chrom20.samtools.vcf.hg19_avsift_filtered

Code Block

title	Example ANNOVAR output on the NA12878 vcf file

language	bash
title	Click here to check your work
collapse	true

cds
mkdir BDIB_Annovar
cd BDIB_Annovar
cp $SCRATCH/BDIB_Human_tutorial/raw_files/N*.vcf .
ls *.vcf | perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1\n";' > commands

Code Block

title	Create the submission script and submit

launcher_creator.py -l annovar.sge -n annovar -t 00:30:00 -j commands -A UT-2015-05-18
qsub annovar.sge

We have ALREADY pre-computed these outputs (although Annovar will run pretty quickly on data from only chr20).

Expand

title	In the mean time, you might want to have a look at the code to annovar_pipe.sh and summarize_annovar.pl. Note that these run Annovar in "gene-based" mode.

Again note the ` characters are "backtick", not apostrophes

Code Block

title	Print out the text of the bash script annovar_pipe.sh

more `which annovar_pipe.sh`

Code Block

title	Print out the text of the perl script summarize_annovar.pl

more `which summarize_annovar.pl`

ANNOVAR output

Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo). It provides all these output files:

Code Block

title	Example ANNOVAR output on the NA12878 vcf file

NA12878.chrom20.samtools.vcf.hg19_esp5400_all_droppedexome_summary.csv
NA12878.chrom20.samtools.vcf.hg19exonic_esp5400variant_all_filteredfunction
NA12878.chrom20.samtools.vcf.hg19genome_genomicSuperDupssummary.csv
NA12878.chrom20.samtools.vcf.hg19_ljb_allALL.sites.2010_11_dropped
NA12878.chrom20.samtools.vcf.hg19_ljb_allALL.sites.2010_11_filtered
NA12878.chrom20.samtools.vcf.hg19_phastConsElements46wayavsift_dropped
NA12878.chrom20.samtools.vcf.hg19_snp132avsift_droppedfiltered
NA12878.chrom20.samtools.vcf.hg19_snp132_esp5400_all_dropped
NA12878.chrom20.samtools.vcf.hg19_esp5400_all_filtered
NA12878.chrom20.samtools.vcf.loghg19_genomicSuperDups
NA12878.chrom20.samtools.vcf.variant_function

I find the exome_summary.csv to be one of the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):

...

Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele).

...

Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15.

...

hg19_ljb_all_dropped
NA12878.chrom20.samtools.vcf.hg19_ljb_all_filtered
NA12878.chrom20.samtools.vcf.hg19_phastConsElements46way
NA12878.chrom20.samtools.vcf.hg19_snp132_dropped
NA12878.chrom20.samtools.vcf.hg19_snp132_filtered
NA12878.chrom20.samtools.vcf.log
NA12878.chrom20.samtools.vcf.variant_function

I find the exome_summary.csv to be one of the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):

LRT_MutationTaster

Func	exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic
Gene	The common gene name
ExonicFunc	frameshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown
AAChange (in gene coordinates)
Conserved (i.e. SNP is in a conserved region)	based on the UCSC 46-way conservation model
SegDup (snp is in a segmental dup. region)
ESP5400_ALL	Alternate Allele Frequency in 3510 NHLBI ESP European American Samples
1000g2010nov_ALL	Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele).
dbSNP132	The id# in dbSNP if it exists
AVSIFT	The AVSIFT score of how deleterious the variant might be
LJB_PhyloP	Conservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative".
LJB_PhyloP_Pred
LJB_SIFT	SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated.
LJB_SIFT_Pred
LJB_PolyPhen2	Functional prediction score for non-syn variants from Mutation Taster Polyphen2 provided by dbNSFP (higher score represents functionally more deleterious). The A score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point.	greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15.
LJB_PolyPhen2_Pred
LJB_GERP++	higher scores are more deleterious
Chr
Start
End
Ref	Reference base
Obs	Observed base-pair or variant
SNP Quality value
filter information
(ALL the VCF info is here!!)
GT:PL:GQ for each file!

Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis! This is one of the many reasons I like Annovar.

Scavenger hunts!

Find the gene with two frameshift deletions in NA12878

Expand

title	Answer is...

DEFB126

(just grep "frameshift" from the exome_summary file)

Test "genetic drift" vs. "functional selection" - e.g. is the distribution of variants different among non-coding regions, synonymous changes in coding regions, and non-synonymous changes in coding regions?

...

title	Answer is...

Compare the output of these three commands:

...

LRT	Functional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point.
LJB_LRT_Pred
LRT_MutationTaster	Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point.
LRT_MutationTaster_Pred
LJB_GERP++	higher scores are more deleterious
Chr
Start
End
Ref	Reference base
Obs	Observed base-pair or variant
SNP Quality value
filter information
(ALL the VCF info is here!!)
GT:PL:GQ for each file!

Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis! This is one of the many reasons I like Annovar.

Scavenger hunts!

Expand

title	Find the gene with two frameshift deletions in NA12878. See what you can come up with as answer on your own and then click here for an answer and an expansion of how to generate more meaningful representations of that data

The final answer is "DEFB126"

Code Block

language	bash

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv  # this will print all the lines which contains the text "frameshift"
# From the output you can key into the first few columns having the information you are interested in: location-classification, gene, mutation type each separated by commas. This should lead you to think about adding the awk command to print only some columns.
 
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}'  # the -F"," syntax forces it to split on commas
# you will likely notice this data is easier to visualize, and in this case you can probably see what gene is represented multiple times, but why stop there ... lets add the uniq -c command to the pipes to have linux count for us
 
grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c 
# for the number of mutations we have this is sufficient, but for increased numbers of mutations where you may be interested in displaying them in a particular order. This can be done by adding the sort command

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r  # the -r option on the sort command  sorts in reverse order

Expand

title	Compare the output of GATK and SAMtools for the NA12878 sample to see if you can find any differences

Code Block

language	bash

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r
grep "frameshift" NA12878.chrom20.samtools.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r

GATK detects 2 frameshift insertions in DNMT3B, and 1 each in PRNP and LAMA5. SAMtools detects 1 frameshift deletion in each of NTSR1 and CTSA that GATK does not detect.

Expand

title	Test "genetic drift" vs. "functional selection" - e.g. is the distribution of variants different among non-coding regions, synonymous changes in coding regions, and non-synonymous changes in coding regions?

Compare the output of these three commands:

Code Block

grep intergenic NA12878.chrom20.samtools.vcf.genome_summary.csv | awk 'BEGIN {FS=","} {print $26"\t"$27}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w synonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20
grep exonic NA12878.chrom20.samtools.vcf.genome_summary.csv | grep -w nonsynonymous | awk 'BEGIN {FS=","} {print $25"\t"$26}' | sort | uniq -c | sort -n -r | head -20

Do you notice a pattern?

What's the right statistical test to determine whether non-synonymous mutations might be under different selective pressure than intergenic or synonymous mutations from this data?

Expand

title	Thinking back to our first example where we looked for a gene with 2 frameshift deletions, can you think of a way to look at all 3 patients' GATK analysis results at once, and look for the total number of mutations detected (sorted by gene) displayed in descending order?

The final form of the command that you had was:

grep "frameshift" NA12878.chrom20.GATK.vcf.exome_summary.csv | awk -F"," '{print $2"\t"$3}' | uniq -c | sort -r

Code Block

language	bash
title	Try to come up with a solution on your own before checking your answer here
collapse	true

cat *GATK.vcf.exome_summary.csv | grep "exonic" | awk -F"," '{print $2,"\t"$3}' | sort | uniq -c | sort -n -rc | headsort -20

Do you notice a pattern?

What's the right statistical test to determine whether non-synonymous mutations might be under different selective pressure than intergenic or synonymous mutations from this data?

nr

Some questions for thought:

Why might this type of analysis be useful?
What other greps would make this more useful?

Other variant annotators:

...

Version	Old Version 1	New Version Current
Changes made by	Deatherage, Daniel E	Deatherage, Daniel E
Saved on	May 28, 2015	May 29, 2015

Versions Compared

Key

ANNOVAR output

ANNOVAR output

Scavenger hunts!

Scavenger hunts!

Other variant annotators: