...
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
# if you have already done the trios tutorial cp $SCRATCH/GVA_Human_trios/raw_files/N*.vcf . # if you have not done the trios tutorial cp $BI/ngs_course/human_variation/N*.vcf . | ||||||
Code Block | ||||||
|
Get access to annovar
Unfortunately we have finally found a program that conda won't install for us. In a related matter, if you look at the annovar page itself, accessing the newest version of annovar is actually behind a registration wall, which is uncommon though not without precedent. Here we will therefore work with an old version of annovar, though if you use it in your own work it is 100% suggested/encouraged/etc that you work with the newest version.
The final complication is while the BioITeam has all of the required things to run annovar, stampede2 compute nodes can't access them. This means we need to copy a database of annotations, and several scripts to our scratch directory in order to run the analysis.
Code Block | ||||
---|---|---|---|---|
| ||||
cp -r $BI/ref_genome/annovar/hg19_annovar_db .
|
This folder is very large, it is expected to take several minutes to copy.
Code Block | ||||
---|---|---|---|---|
| ||||
cp /corral-repl/utexas/BioITeam/bin/annotate_variation.pl . cp /corral-repl/utexas/BioITeam/bin/convert2annovar.pl . cp /corral-repl/utexas/BioITeam/bin/summarize_annovar.pl . cp /corral-repl/utexas/BioITeam/bin/annovarPipe.sh . |
...
Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/ directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file. As you become more proficient with the command line, it is likely you will use various piping techniques to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file
:
Code Block | ||
---|---|---|
| ||
ls *.vcf | \
perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 \n";' > annovar_commands
|
Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named commands.
Expand | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
Executing the commands
Warning | ||
---|---|---|
| ||
For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial. |
Code Block | ||
---|---|---|
| ||
chmod +x commands
./commands |
The easiest way to check if this is working is to use ls and see an explosion of new files. This will take quite a bit of time to complete running. You can check that things are running using the ps command, and looking at the new .log files which are created. As the commands take some time to run, pre-computed versions of these outputs are available so you can begin evaluating the results if you don't want to wait for them to finish. . Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file using perl. As you become more proficient with the command line, it is likely you will use various piping techniques such as these to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file
:
Code Block | ||
---|---|---|
| ||
ls *.vcf | \
perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovarPipe.sh $_ hg19_annovar_db >$1.$2.log 2>&1 \n";' > annovar_commands
|
Note how the print statement includes redirections for generating log files, and the entire output is redirected to a file named annovar_commands.
Expand | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
submitting the job
As you may have guessed when we started creating a commands file, this analysis is headed for the job queue system. As we have done elsewhere: copy the launcher file, and make relevant edits to the analysis we are attempting to perform.
Code Block | ||||
---|---|---|---|---|
| ||||
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm annovar.slurm
nano annovar.slurm
sbatch annovar.slrum |
Note that the above block does not include how to make the edits, nor the saving and closing of the slurm file. The needed edits are:
Line number | As is | To be |
---|---|---|
16 | #SBATCH -J jobName | #SBATCH -J spades |
17 | #SBATCH -n 1 | #SBATCH -n 6 |
22 | ##SBATCH --mail-user=ADD | #SBATCH --mail-user=<YourEmailAddress> |
23 | ##SBATCH --mail-type=all | #SBATCH --mail-type=all |
27 | conda activate GVA2021 | |
31 | export LAUNCHER_JOB_FILE=commands | export LAUNCHER_JOB_FILE=annovar_commands |
The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
Line 27 for the first time does not include a conda activation command as we are not working within a conda environment. By the same token it would be acceptable to list any environment.
A 12 hour run is requested because while I have been able to verify the analysis is working with a subset of the data, I have not been able to get the full analysis to complete. I am assuming that it will complete in less than 12 hours and will again update this page when I have verified this.
Again use ctl-o and ctl-x to save the file and exit.
Accessing pre-computed results
...
The exome_summary.csv
is probably the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):
Func | exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, intergenic |
Gene | The common gene name |
ExonicFunc | frameshift insertion/deletion/block subst, stopgain, stoploss, nonframeshift ins/del/block stubst., nonsynonymous SNV, synonymous SNV, or Unknown |
AAChange (in gene coordinates) | |
Conserved (i.e. SNP is in a conserved region) | based on the UCSC 46-way conservation model |
SegDup (snp is in a segmental dup. region) | |
ESP5400_ALL | Alternate Allele Frequency in 3510 NHLBI ESP European American Samples |
1000g2010nov_ALL | Alternative Allele Frequency in 1000 genomes pilot project 2012 Feb release (minor allele could be reference or alternative allele). |
dbSNP132 | The id# in dbSNP if it exists |
AVSIFT | The AVSIFT score of how deleterious the variant might be |
LJB_PhyloP | Conservation score provided by dbNSFP which is re-scaled from original phylop score. The new score ranges from 0-1 with larger scores signifying higher conservation. A recommended cutoff threshold is 0.95. If the score > 0.95, the prediction is "conservative". if the score <0.95, the prediction is "non-conservative". |
LJB_PhyloP_Pred | |
LJB_SIFT | SIFT takes a query sequence and uses multiple alignment information to predict tolerated and deleterious substitutions for every position of the query sequence. Positions with normalized probabilities less than 0.05 are predicted to be deleterious, those greater than or equal to 0.05 are predicted to be tolerated. |
LJB_SIFT_Pred | |
LJB_PolyPhen2 | Functional prediction score for non-syn variants from Polyphen2 provided by dbNSFP (higher score represents functionally more deleterious). A score greater than 0.85 corresponds to prediciton of "probably damaging". The prediciton is "possbily damaging" if score is between 0.85 and 0.15, and "benign" if score is below 0.15. |
LJB_PolyPhen2_Pred | |
LJB_LRT | Functional prediction score for non-syn variants from LRT provided by dbNSFP (higher score represents functionally more deleterious. It ranges from 0 to 1. This score needs to be combined with other information prediction. If a threshold has to be picked up under some situation, 0.995 can be used as starting point. |
LJB_LRT_Pred | |
LRT_MutationTaster | Functional prediction score for non-syn variants from Mutation Taster provided by dbNSFP (higher score represents functionally more deleterious). The score ranges from 0 to 1. Similar to LRT, the prediction is not entirely depending on the score alone. But if a threshold has to be picked, 0.5 is the recommended as the starting point. |
LRT_MutationTaster_Pred | |
LJB_GERP++ | higher scores are more deleterious |
Chr | |
Start | |
End | |
Ref | Reference base |
Obs | Observed base-pair or variant |
SNP Quality value | |
filter information | |
(ALL the VCF info is here!!) | |
GT:PL:GQ for each file! |
Everything after the "LJB_GERP++" field in exome_summary came from the original VCF file, so this file REALLY contains everything you need to go on to functional analysis! This is one of the many reasons I like Annovar.
...
- http://www.yandell-lab.org/software/vaast.html
- http://www.broadinstitute.org/gatk/gatkdocs/ VariantAnnotator annotations
- http://www.bioconductor.org/help/workflows/variants/
- http://vat.gersteinlab.org/
- http://code.google.com/p/mu2a/