Overview
Velvet SPAdes
is a De Bruijn graph assembler works fairly rapidly on short (microbial) genomes. In this tutorial we will use velvet to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though as Oxford Nanopore comes online it will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the gsaf.
Learning Objectives
- Run velvet SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
Table of Contents
Table of Contents |
---|
Data
Tutorial assumes that you are on an idev node. If you are not sure please ask for help.
Code Block | ||
---|---|---|
| ||
cds mkdir BDIBGVA_velvet_tutorial cp $BI/ngs_course/velvet/data/*/* BDIBGVA_velvet_tutorial cd BDIB_velvet_tutorial |
...
There are 4 sets of simulated reads:
...
Set 1 | Set 2 | Set 3 | Set 4 | |
---|---|---|---|---|
Read Size | 100 | 100 | 100 | 100 |
Paired/Single Reads | Single | Paired | Paired | Paired |
Gap Sizes | NA | 400 | 400, 3000 | 400, 3000, 1500 |
Coverage | 50 | 50 | 25 for each subset | 20 for each subset |
Number of Subsets | 1 | 1 | 2 | 3 |
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.
...
Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
Velvet Assembly
Now let's use Velvet to assemble the reads.
...
Tip | ||
---|---|---|
| ||
As we discussed in our piping tutorial, things separated by a ';' mark will execute 1 after the other. In the case of &&, the second command will only execute if the first command finishes correctly (exit status zero to get technical). This can be useful to consider when building a pipeline to limit progression from 1 program to another when something has failed, BUT only if you understand exit states of each program. |
...
Velvet Output
Explore each output directory that was created by Velvet.
...
Expand | ||||
---|---|---|---|---|
| ||||
|
More assembly statistics: contig_stats.pl
The output file stats.txt contains information about every contig in the assembly, but it isn't sorted and can be a bit overwhelming.
...
Transfer back several of the different outputs into their own direcotry and being comparing them to determine which library and set of parameters seemed to work best.
What do I do now?
Many choices:
- Get a better assembly: maybe add a different library size, or go into a detailed genome completion project (commonly called "finishing") using a sequence assembly editor like
consed
orgap4
orAMOS
. (Be careful though, the amount of data in NGS data sets can be very difficult for these programs to deal with, since many were designed for Sanger sequencing reads.) You may have a lot of PCR products to make to close gaps and/or to order and orient scaffolds.consed
in particular makes this pretty easy, but it may still consume a lot more time and money than the initial shotgun assembly. You can identify some misassemblies by mapping the original reads to the assembly and then viewing them in IGV to look for discordant mate pairs, for example. - Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Most assemblers will be able to take 2x100 data and give you full gene sequences since these are non-repetitive and so assemble well, and obviously 2x600 miSEQ runs will do even better. You can turn the contigs.fa into a blast database (
formatdb
ormakeblastdb
depending on which version of blast you have) and start blasting away.
Further Reading
- Examples of using single-end, paired-end, and mate-paired data in Velvet at the GenomeFactory.
- Another velvet tutorial
...