Transcriptome assembly & annotation

Transcriptome assembly and annotation

It is not uncommon to perform de novo transcriptome assembly before even sequencing an organism's genome. For organisms with more than two copies per chromosome, it is vastly simpler than whole genome de novo sequencing and often yields the most useful information for the least money. The Matz Lab at UT has been highly successful in this arena.

Most published transcriptome assemblies are based on Roche/454 sequence data, but the current generation of Illumina 2x100 reads are capable of providing excellent transcriptome assemblies.

What's the big deal?

Transcriptome assembly is quite distinct from whole genome assembly for three reasons:
a) The coverage is absolutely not uniform, even for normalized cDNA libraries
b) "Contigs" are expected to be short (relative to the whole genome) and numerous
c) Ambiguities like paralogs and splice variation
Most assemblers have dealt with a) and b); it's not clear from the literature if c) has really been addressed well yet.

To normalize or not to normalize...

The question of whether to normalize cDNA prior to sequencing remains open. Protocols for normalization work fairly well, but they focus on simply reducing the amount of the most abundant sequences and so still leave significant variation in abundance. It's probably best to evaluate normalization in the context of the research overall; for example, is it better to recover both draft transcripts and abundance estimates from two different tissues, timepoints, or developmental stages, or pool RNA from all of these first for the sake of assembly? The answer may depend on your goal - if you seek novel genes which you expect are highly represented in a particular condition, you may not want to normalize. Conversely, if you are strictly annotating a de novo genome sequence, pooling and normalization might be more useful.

How to assemble a transcriptome

Many assemblers exist - notable due to their popularity are:

Velvet assembler and Oases post-processing module
TRANS-ABySS
Trinity (called trinityrnaseq on lonestar)
SOAPdenovo-Trans

If you're just getting started, I would encourage you to use TACC's resources and try them all with many different parameters for each one in parallel. Then evaluate all the assemblies, decide whether to pick your favorite, try different conditions, or merge several of them.

Whatever you do, do it with scripts - I guarantee you'll be doing it again

For this exercise, we'll try the Velvet assembler with the Oases post-processing module
The raw data for this exercise is from here:
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3016421/
Downloaded via:

Download command for whole transcriptome data from sweetpotato

wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR063/SRR063318/SRR063318.sra

you have to reformat it of course

module load sratoolkit
$TACC_SRATOOLKIT_DIR/fastq-dump SRR063318.sra
cat SRR063318.fastq | awk 'BEGIN {c=0} {c++; if (c==1) {tag=$1} if (c==2) {seq=$1} if (c==3) {} if (c==4) {print tag "_1\n" substr(seq,1,75) "\n+\n" substr($1,1,75) "\n" tag "_2\n" substr(seq,76) "\n+\n" substr($1,76); c=0}}' > SRR063318.paired.fastq

This last command re-formats the paired-end data into the proper format for the velvet assembler which expects read pairs one after another.

Velvet commands

OK - it's 11:50 pm and Lonestar's down... going to bed... hoping jobs will run fast tomorrow AM...