...
Because these tools draw in information from may disparate sources, they can be very difficult to install, configure, use, and maintain. For example, the vcf
files from the 1000 Genomes project are arranged in a deep ftp tree by date of data generation. Large genome centers spend significant resources managing these tools.
Pre-packaged programs
Annovar - one of the most powerful yet simple to run variant annotators available
...
Expand | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
The which command is used to give the location of a program or script that is in your $PATH.
This script simply does a format conversion and then calls |
...
Code Block | ||
---|---|---|
| ||
ls $BI/ngs_course/human_variation/N*.vcf | \
perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1\n";' \
> commands
|
Try to modify the previous code block to run in a new directory called BDIB_Annovar with from the .vcf files from the 3 individuals for both samtools and gatk that we looked at yesterday. Hint: you copied these files into your $SCRATCH
/BDIB_Human_tutorial/raw_files directory yesterday.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds
mkdir BDIB_Annovar
cd BDIB_Annovar
cp $SCRATCH/BDIB_Human_tutorial/raw_files/N*.vcf .
ls *.vcf | perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1\n";' > commands
|
Code Block | ||
---|---|---|
| ||
launcher_creator.py -l annovar.sge -n annovar -t 00:30:00 -j commands -A UT-2015-05-18
qsub annovar.sge
|
We have ALREADY pre-computed these outputs (although Annovar will run pretty quickly on data from only chr20).
Expand | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
Again note the ` characters are "backtick", not apostrophes
|
Which begs the question what does
change name of code block
This script is written in perl (a programing language that while powerful, is not very favored among most computational biologists who typically prefer: Python, R, or Bash shell scripts) as a method of standardizing input to call a series of external commands. Such wrappers (or wrappers within wrappers) make your life much easier. breseq itself has wrappers for using bowtie2 and samtools which you ran separately on other data. As you increase your understanding of scripts and the command line by looking at what others have done you may begin to make your own wrappers and small scripts, but such wrappers are a great example of the BioITeam and other community resources can provide you with. While we have told you no one will ever care as much about your data as you do, quality of life issues regarding repetitive treatment of similar inputs and outputs may be easily solved by someone else. |
Now let's run it on the .vcf files from the 3 individuals (NA12878, NA12891, and NA12892) from both the samtools and gatk output in the $BI/ngs_course/human_variation/ directory. (You may recognize these as the same individuals that we worked with on the Trios tutorial. Throughout the class we've been teaching you how to create a commands file using nano, but here we provide a more complex example of how you can generate a commands file. As you become more proficient with the command line, it is likely you will use various piping techniques to generate commands file. The following calls Perl to custom-create the 6 command lines needed and put them straight into a commands file
:
Code Block | ||
---|---|---|
| ||
ls $BI/ngs_course/human_variation/N*.vcf | \
perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1\n";' > commands
|
*make me a note*
"commands" files are 1 way to both create a record of what it is that you are doing as well as an easy way to execute multiple commands simultaneously. In a previous tutorial (the advanced breseq tutorial, or the 2nd breseq example from the first day) we showed you how to do this by adding an & as the last character of the line to force the command to execute in the background. Tomorrow we will go over the more common method of submitting "commands" files to the que to run as jobs rather than being spoiled with interacting with everything on idev nodes. In computational biology this instructor has often found that once I learn to do something a particular way, it takes an incredible amount of inertia to change how I do something even when i know it would help to do so. In an effort to learn from me making things harder on myself than need be, I urge you to STRONGLY consider not naming every list of commands you want to run "commands" but rather something that is actually descriptive, even "commands_DATE" would be helpful. As you start submitting your own jobs you will quickly be able to build bad habits, try to be aware of and avoid this one if possible.
make me an expand
investigate the commands file to to determine what the piping command actually did (click for answer)
make me a code block
cat commands
Try to modify the previous code block to run in a new directory called BDIB_Annovar with from the .vcf files from the 3 individuals for both samtools and gatk that we looked at in the Trios tutorial. Hint: you copied these files into your $SCRATCH
/BDIB_Human_tutorial/raw_files directory yesterday.
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
cds
mkdir BDIB_Annovar
cd BDIB_Annovar
cp $SCRATCH/BDIB_Human_tutorial/raw_files/N*.vcf .
ls *.vcf | perl -n -e 'chomp; $_=~/(NA\d+).*(sam|GATK)/; print "annovar_pipe.sh $_ >$1.$2.log 2>&1 &\n";' > commands
|
change title of codeblock = MAKE SURE YOU ARE ON IDEV if you want to run
Code Block | ||
---|---|---|
| ||
chmod +x commands
./commands |
This will take quite a bit of time to complete running. As such, we have ALREADY pre-computed these outputs so you can begin evaluating the results.
ANNOVAR output
Annovar does a ton of work in assessing variants for us (though if you were going for clinical interpretation, you still have a long way to go - compare this to RUNES or CarpeNovo). It provides all these output files:
Code Block | ||
---|---|---|
| ||
NA12878.chrom20.samtools.vcf.exome_summary.csv NA12878.chrom20.samtools.vcf.exonic_variant_function NA12878.chrom20.samtools.vcf.genome_summary.csv NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_dropped NA12878.chrom20.samtools.vcf.hg19_ALL.sites.2010_11_filtered NA12878.chrom20.samtools.vcf.hg19_avsift_dropped NA12878.chrom20.samtools.vcf.hg19_avsift_filtered NA12878.chrom20.samtools.vcf.hg19_esp5400_all_dropped NA12878.chrom20.samtools.vcf.hg19_esp5400_all_filtered NA12878.chrom20.samtools.vcf.hg19_genomicSuperDups NA12878.chrom20.samtools.vcf.hg19_ljb_all_dropped NA12878.chrom20.samtools.vcf.hg19_ljb_all_filtered NA12878.chrom20.samtools.vcf.hg19_phastConsElements46way NA12878.chrom20.samtools.vcf.hg19_snp132_dropped NA12878.chrom20.samtools.vcf.hg19_snp132_filtered NA12878.chrom20.samtools.vcf.log NA12878.chrom20.samtools.vcf.variant_function |
I find the The exome_summary.csv
to be one of is probably the most useful files because it brings together nearly all the useful information. Here are the fields in that file (see these docs for more information, or the Annovar filter descriptions page here):
...