Transcriptome assembly & annotation

Transcriptome assembly and annotation

It is not uncommon to perform de novo transcriptome assembly before even sequencing an organism's genome. For organisms with more than two copies per chromosome, it is vastly simpler than whole genome de novo sequencing and often yields the most useful information for the least money. The Matz Lab at UT has been highly successful in this arena.

Most published transcriptome assemblies are based on Roche/454 sequence data, but the current generation of Illumina 2x100 reads are capable of providing excellent transcriptome assemblies.

What's the big deal?

Transcriptome assembly is quite distinct from whole genome assembly for three reasons:
a) The coverage is absolutely not uniform, even for normalized cDNA libraries
b) "Contigs" are expected to be short (relative to the whole genome) and numerous
c) Ambiguities like paralogs and splice variation
Most assemblers have dealt with a) and b); it's not clear from the literature if c) has really been addressed well yet.

To normalize or not to normalize...

The question of whether to normalize cDNA prior to sequencing remains open. Protocols for normalization work fairly well, but they focus on simply reducing the amount of the most abundant sequences and so still leave significant variation in abundance. It's probably best to evaluate normalization in the context of the research overall; for example, is it better to recover both draft transcripts and abundance estimates from two different tissues, timepoints, or developmental stages, or pool RNA from all of these first for the sake of assembly? The answer may depend on your goal - if you seek novel genes which you expect are highly represented in a particular condition, you may not want to normalize. Conversely, if you are strictly annotating a de novo genome sequence, pooling and normalization might be more useful.

One way to assemble a transcriptome

Many assemblers exist - notable due to their popularity are:

If you're just getting started, I would encourage you to use TACC's resources and try them all with many different parameters for each one in parallel. Then evaluate all the assemblies, decide whether to pick your favorite, try different conditions, or merge several of them.

Whatever you do, do it with scripts - I guarantee you'll be doing it again

For this exercise, we'll try the Velvet assembler with the Oases post-processing module
The raw data for this exercise is from here:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3016421/
Downloaded via:

Download command for whole transcriptome data from sweetpotato
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR063/SRR063318/SRR063318.sra
 you have to reformat it of course
module load sratoolkit
$TACC_SRATOOLKIT_DIR/fastq-dump SRR063318.sra
cat SRR063318.fastq | awk 'BEGIN {c=0} \
   {c++; if (c==1) {tag=$1} if (c==2) {seq=$1} if (c==3) {} \
   if (c==4) {print tag "_1\n" substr(seq,1,75) "\n+\n" substr($1,1,75) "\n" \
   tag "_2\n" substr(seq,76) "\n+\n" substr($1,76); c=0}}' > SRR063318.paired.fastq

This last command re-formats the paired-end data into the proper format for the velvet assembler which expects read pairs one after another.

NOTE: You should always do some degree of read filtering and/or trimming here. There is debate about this - there is evidence that filtering by quality value is a bad idea because quality values can be confounded with genuine sequence characteristics - e.g. strong secondary structure, GC%, etc., which will weaken your assembly. At the very least, everyone agrees you'd better remove adaptor sequences! The Oases manual has an extensive list of tools that can help do this filtering/trimming.

For the sake of educational simplicity, I'm going to skip this step for now.

Velvet commands
cp -r /corral-repl/utexas/BioITeam/ngs_course/transcriptome_assembly .
cd transcriptome_assembly
qsub launcher.sge
qsub -hold_jid velvetk merge.sge

All the velvet commands are embedded in the file commands and should be run with the special launcher.sge file provided. This launcher.sge is special because it allocates each of the seven velveth/velvetg/oases commands to seven separate nodes. The second qsub uses the special -hold_jid option which tells the queue manager to hold that job and submit it after the "velvetk" job is complete. If you examine merge.sge you'll see that it doesn't use the "launcher" mechanism but instead runs just one set of commands (the velvet/oases "merge" pass) directly in the submission script. This is equivalent to having only one line in the commands file and using the launcher script as we have been.

Exploring the output

  1. Map the raw data back and examine from both the perspective of the assembly (distribution of coverage, gaps, "variants" (which should agree with the polidy of your organism), etc) and from the perspective of the raw data (i.e. if only 10% of your raw data maps to your assembled transcripts, you should not be feeling particularly proud.)
  2. Similarity searching: to proteins of similar species, to the Conserved Domain Database which uses rpsblast, available on Lonestar, to nr (which then opens up gene ontology matching), etc. Note that nr and Cdd can be found on Lonestar at /corral-repl/utexas/BioITeam/blastdb.
  3. Boot up igv and poke around:
    1. "Import genome" and load the final transcripts.fa file
    2. Load your blast results by first converting them to gff files as explained on this page
    3. Load the raw data mapping results
  4. Do some differential expression by simply counting the number of reads hitting each transcript. You'd be wise to make sure your mapper does something useful with reads that hit multiple transcripts. I like to run the analysis twice: once having all non-unique mapping reads placed everywhere possible, and once with all non-unique reads removed completely. If the DE results agree between these two, you have comfort that this hasn't fooled you.