Introduction
The next step is to annotate your genome assembly. First you have to determine which parts of the assembly are actually genes. The program Glimmer3 does a good job at predicting genes in a prokaryotic assembly. These predictions can then be queried against NCBI's Conserved Domain Database and/or the nr database.
Installing Glimmer3
Glimmer3 has been installed manually in the BioITeam bin $BI/bin
since it is not offered as a module on TACC.
Remember from day 2 that there are three general steps for installing a linux tool.
1. Download and uncompress the Glimmer3 source code
login1$ cdh login1$ wget http://www.cbcb.umd.edu/software/glimmer/glimmer302.tar.gz login1$ tar xzf glimmer302.tar.gz
2. Compile the Glimmer3 programs and copy them to ~/local/bin/
login1$ cd glimmer3.02/src/ login1$ make login1$ cd .. login1$ cp bin/* ~/local/bin/
3. Your $PATH variable should have been set up on day 2 to look in ~/local/bin/
for executables. If not, update your $HOME/.profile_user
file with the following line.
export PATH="$HOME/local/bin:$PATH"
Running Glimmer3
Running Glimmer3 is a two-step process. First, a probability model of coding sequences, called an interpolated context model or ICM, must be built. Once that has been built, the glimmer3 program itself is run to analyze the assembled genome and make gene predictions.
Fortunately, Glimmer3 comes with several C-shell scripts that automate the whole process. This tutorial will take advantage of those. The scripts require some minor editing, but we have already done that for you.
We'll run Glimmer on a de novo assembly of the bacterium Acinetobacter baumannii. First copy the contigs.fa
file that velvet produced, and then execute the run_glimmer.sh
script. This script preprocesses the contigs.fa
file and calls the g3-from-scratch.csh
script that was prepackaged with glimmer3. Several files will be created, but the one containing the predicted genes is called contigs.fa.glimmer.predict.genes.
but they must be slightly edited first. At the top of each script are specified the directory paths to the Glimmer executables and Awk scripts (the lines beginning with set glimmerpath and set awkpath). You will need to change these entries to the directories where these files were installed. The Glimmer executables are in ~/local/bin/
and the awk scripts can be found in ~/glimmer3.02/scripts
. You only have to make these changes in the g3-from-scratch.csh
script for now. After editing it, lines 28 and 29 should look something like the following.
set awkpath = $HOME/glimmer3.02/scripts set glimmerpath = $HOME/local/bin
We will actually use another script (run_glimmer.sh
) to preprocess the contigs.fa file (produced by velvet) and to call g3-from-scratch.csh
.
cdw mkdir glimmer_example cd glimmer_example cp /corral-repl/utexas/BioITeam/sphsmith/run_glimmer.sh .
run_glimmer.sh
must also be edited so that it knows where to look for the g3-from-scratch.csh
script. After editing, line 11 something like the following.
$HOME/glimmer3.02/scripts/g3-from-scratch.csh $infile.cat $infile.glimmer
The scripts are now ready to be used. We'll run Glimmer on a de novo assembly of the bacterium Acinetobacter baumannii. First copy the contigs.fa
file that velvet produced, and then execute the run_glimmer.sh
script. Several files will be created, but the one containing the predicted genes is called contigs.fa.glimmer.predict.genes.
cdw mkdir glimmer_example cd glimmer_example cp $BI/ngs_course/velvet/real_set/contigs.fa . run_glimmer.sh contigs.fa
Annotating predicted genes
The key problem from here is to assign function to these genes. Homology is your best friend - here are a few starting points.
We'll BLAST this file against the nr database which is located in /corral-repl/utexas/BioITeam/blastdb
. Create a commands file with the following line.
blastx -query contigs.fa.glimmer.predict.genes -out nr_result.txt -db /corral-repl/utexas/BioITeam/blastdb/nr -outfmt 6
To BLAST the predicted genes, you could load the blast
module, use launcher_creator.py
and qsub
. As a newer alternative, you could read about this nifty script written by Bioinformatics Consultant Benni Goetz and use it to run this process MUCH faster.
Upon completion the blast results can be converted to GFF format and be viewed in IGV. Instead of waiting for blastx to finish, you can copy our partial search results.
cp $BI/ngs_course/nr_result.txt . bl2gff.pl nr_result.txt > nr_result.gff
Note that we've written and provided the bl2gff.pl
script in $BI/bin
which converts blast output into a gff
file.
To take advantage of TACC, you can also make your BLAST query run on multiple nodes. split_blast
will take your BLAST command, split the data, run BLAST on the splits in parallel, and then combine the outputs. This can speed up your BLAST queries by orders of magnitude.
Another path to assign gene function is the Conserved Domain Database. This database abstracts not just "reference proteins" but "reference domains", clustered into families.
You might also be interested in assigning genes to Gene Ontology categories; you can work hard at this, or take a lot of shortcuts. Here is a paper illustrating a method. You'll note that the Gene Ontology officially supports many different "lines of evidence" to assignment of gene ontologies.
Other pipelines for automated annotation
The NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) streamlines the whole annotation process for you. The pipeline is currently under development, but a standalone package is available here, however.
Evaluating Assemblies and Annotations
Two tools exist for evaluating an assembly and annotation. These are designed for eukaryotic organisms, however.
PASA (Program to Assemble Spliced Alignments) is a eukaryotic genome annotation tool that exploits spliced alignements of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data.
CEGMA (Core Eukaryotic Genes Mapping Approach) is tool for building a highly reliable set of gene annotations in the absence of experimental data. It defines a set of 458 core proteins that are present in a wide range of taxa. Due to the high conservation of these proteins, sequence alignement methods can reliably identify their exon-intron structures in genomic sequences. The resulting dataset can be used to train a gene finder or to assess the completeness of the genome or annotations.
Annotating Other Types of Features
There are many other databases and types of tailored search programs that are better for predicting other types of features. Here's a sampling:
- Rfam - General database for finding noncoding RNAs.
- tRNAScan-SE - specific fast and accurate searches for tRNA genes and pseudogenes.
- miRBase - miRNAs