Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 5.3

...

Introduction

The next step is to annotate your genome assembly. First you have to determine which parts of the assembly are actually genes. The program Glimmer3 does a good job at predicting genes in a prokaryotic assembly. These predictions can then be queried against NCBI's Conserved Domain Database and/or the nr database. 

Installing Glimmer3

Glimmer3 has been installed manually in the BioITeam bin $BI/bin since it is not offered as a module on TACC.

Running Glimmer3

Running Glimmer3 is a two-step process. First, a probability model of coding sequences, called an interpolated context model or ICM, must be built. Once that has been built, the glimmer3 program itself is run to analyze the assembled genome and make gene predictions.

We'll run Glimmer on a de novo assembly of the bacterium Acinetobacter baumannii. First copy the contigs.fa file that velvet produced, and then execute the run_glimmer.sh script. This script preprocesses the contigs.fa file and calls the g3-from-scratch.csh script that was prepackaged with glimmer3. Several files will be created, but the one containing the predicted genes is called contigs.fa.glimmer.predict.genes.

Code Block
titleRunning glimmer
 cdw
 mkdir glimmer_example
 cd glimmer_example
 cp $BI/ngs_course/velvet/real_set/contigs.fa .
 run_glimmer.sh contigs.fa
{code}

h1.  Annotating predicted genes

The key problem from here is to assign function to these genes.  Homology is your best friend - here are a few starting points.

We'll BLAST this file against the nr database which is located in {{

 Annotating predicted genes

The key problem from here is to assign function to these genes. Homology is your best friend - here are a few starting points.

We'll BLAST this file against the nr database which is located in /corral-repl/utexas/BioITeam/blastdb

...

.

...

Create

...

a

...

commands

...

file

...

with

...

the

...

following

...

line.

...

 

Code Block
titleCommand line to run blastx of the predicted genes against the nr database
 blastx -query contigs.fa.glimmer.predict.genes -out nr_result.txt -db /corral-repl/utexas/BioITeam/blastdb/nr -outfmt 6
{code}

To

...

BLAST

...

the

...

predicted

...

genes,

...

you

...

could

...

load

...

the

...

blast

...

module,

...

use

...

launcher_creator.py

...

and

...

qsub

...

.

...

As

...

a

...

newer

...

alternative,

...

you

...

could

...

read

...

about

...

this

...

nifty

...

script

...

written

...

by

...

Bioinformatics

...

Consultant

...

Benni

...

Goetz

...

and

...

use

...

it

...

to

...

run

...

this

...

process

...

MUCH

...

faster.

...

} {
Expand
Need help? Click here to see how to run standard blastx...
Need help? Click here to see how to run standard blastx...
Code Block
:
title
=
Commands
to
blast
contigs
against
nr
}
 
echo "blastx -query contigs.fa.glimmer.predict.genes -out nr_result.txt -db /corral-repl/utexas/BioITeam/blastdb/nr -outfmt 6" > commands
 module load blast
 launcher_creator.py -n blastx -q normal -t 24:00:00 -j commands
 qsub launcher.sge
{code} {expand} {expand:or click here to see how to run Benni's greatly sped-up blastx...} {code:title=Commands to blast contigs against nr using Benni's script split_blast}
Expand
or click here to see how to run Benni's greatly sped-up blastx...
or click here to see how to run Benni's greatly sped-up blastx...
Code Block
titleCommands to blast contigs against nr using Benni's script split_blast
split_blast -N 6 -a 20130520NGS-FAC -t 1:00:00 blastx -outfmt 6 -db /corral-repl/utexas/BioITeam/blastdb/nr -max_target_seqs 1 -query contigs.fa.glimmer.predict.genes -out contigs.fa.glimmer.predict.genes.nr.blastx.out
{code} {expand} Upon completion the blast results can be converted to GFF format and be viewed in IGV. Instead of waiting for blastx to finish, you can copy our partial search results.  {anchor:bl2gff} {code}

Upon completion the blast results can be converted to GFF format and be viewed in IGV. Instead of waiting for blastx to finish, you can copy our partial search results. 

Anchor
bl2gff
bl2gff

Code Block
 cp $BI/ngs_course/nr_result.txt .
 bl2gff.pl nr_result.txt > nr_result.gff
{code}

Note

...

that

...

we've

...

written

...

and

...

provided

...

the

...

bl2gff.pl

...

script

...

in

...

$BI/bin

...

which

...

converts

...

blast

...

output

...

into

...

a

...

gff

...

file.

...

To

...

take

...

advantage

...

of

...

TACC,

...

you

...

can

...

also

...

make

...

your

...

BLAST

...

query

...

run

...

on

...

multiple

...

nodes.

...

split_blast

...

will

...

take

...

your

...

BLAST

...

command,

...

split

...

the

...

data,

...

run

...

BLAST

...

on

...

the

...

splits

...

in

...

parallel,

...

and

...

then

...

combine

...

the

...

outputs.

...

This

...

can

...

speed

...

up

...

your

...

BLAST

...

queries

...

by

...

orders

...

of

...

magnitude.

...

Another

...

path

...

to

...

assign

...

gene

...

function

...

is

...

the

...

Conserved

...

Domain

...

Database

...

.

...

This

...

database

...

abstracts

...

not

...

just

...

"reference

...

proteins"

...

but

...

"reference

...

domains",

...

clustered

...

into

...

families.

...

You

...

might

...

also

...

be

...

interested

...

in

...

assigning

...

genes

...

to

...

Gene

...

Ontology

...

categories;

...

you

...

can

...

work

...

hard

...

at

...

this,

...

or

...

take

...

a

...

lot

...

of

...

shortcuts.

...

Here

...

is

...

a

...

paper

...

illustrating

...

a

...

method

...

.

...

You'll

...

note

...

that

...

the

...

Gene

...

Ontology

...

officially

...

supports

...

many

...

different

...

"lines

...

of

...

evidence"

...

to

...

assignment

...

of

...

gene

...

ontologies.

...

Other

...

pipelines

...

for

...

automated annotation

The RAST webserver (registration required) provides on-demand annotation of genes in microbial or organellar genomes.

The NCBI Prokaryotic Genomes Automatic Annotation Pipeline (PGAAP) streamlines the whole annotation process for you. The pipeline is currently under development, but a standalone package is available here, however.  

Evaluating Assemblies and Annotations

Two tools exist for evaluating an assembly and annotation. These are designed for eukaryotic organisms, however.

PASA (Program to Assemble Spliced Alignments) is  a eukaryotic genome annotation tool that exploits spliced alignements of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. 

CEGMA (Core Eukaryotic Genes Mapping Approach) is tool for building a highly reliable set of gene annotations in the absence of experimental data. It defines a set of 458 core proteins that are present in a wide range of taxa. Due to the high conservation of these proteins, sequence alignement methods can reliably identify their exon-intron structures in genomic sequences. The resulting dataset can be used to train a gene finder or to assess the completeness of the genome or annotations.  

Annotating Other Types of Features

There are many other databases and types of tailored search programs that are better for predicting other types of features. Here's a sampling:

  • Rfam - General database for finding noncoding RNAs.
  • tRNAScan-SE - specific fast and accurate searches for tRNA genes and pseudogenes.
  • miRBase - miRNAs