...
Learning Objectives
Installing Prokka
Code Block |
---|
language | bash |
---|
title | Conda installation |
---|
|
conda create -n GVA-prokka -c conda-forge -c bioconda -c defaults prokka
conda activate GVA-prokka |
Code Block |
---|
language | bash |
---|
title | Check for correct installation and get review command line options |
---|
|
prokka --version
prokka --listdb
prokka --help |
Note the somewhat novel '--listdb' call. Since prokka is a program that works largely by comparing sequences to other sources, knowing what references it has access to is of equal importance as having the program working. In such situations the the program, and the associated databases may be updated independently.
Panel |
---|
borderColor | green |
---|
borderWidth | 2 |
---|
borderStyle | solid |
---|
title | expected output should be similar to |
---|
|
prokka 1.14.6
Looking for databases in: /work2/01821/ded/stampede2/miniconda3/envs/GVA-prokka/db * Kingdoms: Archaea Bacteria Mitochondria Viruses * Genera: Enterococcus Escherichia Staphylococcus * HMMs: HAMAP * CMs: Archaea Bacteria Viruses
help command should give list of options you are familiar with by now |
Get Some Data
If you have already run the SPAdes tutorial for assembling full bacterial genomes from simulated reads, it is recommended that you use one or more of the set of assembled contigs.
Expand |
---|
title | In the SPAdes tutorial, you used several different sets of data, which set of results do you think you should use for annotation? |
---|
|
The contigs.fa file corresponding to the "400_1500_3000" data set gives the highest quality assembly given the larger insert sizes and higher overall coverage. |
Code Block |
---|
|
mkdir $SCRATCH/GVA_Prokka
cd $SCRATCH/GVA_Prokka
cp ../GVA_SPAdes_tutorial |
Running Prokka
Using the prokka --help command, what options seem particularly useful or important to you?
Expand |
---|
title | Options that stand out to me |
---|
|
Options important for controlling files and program speed. Option | Purpose | Note |
---|
--outdir
| location to store files | As mentioned in other tutorials, not all programs can create new directories. Generally any that offer option | --prefix
| base file name to use for new files | prefix/base names will always come at the front of new files, typically with more detailed extension information afterwords. This is also typically a clue that there will be multiple output files generated. | --cpus | number of threads to use, controlling sped | Interestingly, allows for a value of zero to be entered, allowing the program to identify how many processors potentially has access to, and then using them all. |
Options important for determining what predictions will be made. Option | Purpose | Note |
---|
--proteins | Provide a protein fasta file or genbank file to search against first | This is more useful when you know your strain to be closely related to an existing strain | --evalue | Higher values produce fewer matches | This may be something you interact recursively with depending if you feel you have large regions of the genome that remain unannotated (unlikely) or you have lots of small fragments of genes. Likely left as default unless experiencing these issues. May also be solved by providing additional databases. | --coverage | fraction of protein that must match to be annotated | Another control point for how many annotations you will receive |
|
For our example, we will leave proteins, evalue and covereage all at their defaults making our command rather simple.
Code Block |
---|
language | bash |
---|
title | Try to determine yourself before comparing against 1 reasonable solution |
---|
collapse | true |
---|
|
mkdir gene_annotations
prokka --outdir gene_annotations --prefix mygenome contigs.fa |
Evaluating output
Next steps and optional exercises