/
Genome Assembly (SPAdes) -- GVA2020

Genome Assembly (SPAdes) -- GVA2020

Overview

SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.

A note about read preprocessing

While not explicitly covered here, the presence of adapter sequences on reads when trying to assemble them can significantly complicate assembly and harm it. If using this tutorial on your own samples make sure you are working with the best data possible .. reads lacking adapters in this case.

For those looking for a real challenge, go through the multiqc tutorial and the trimmomatic tutorial, and use the information provided here to compare assemblies of some of the same samples in both cases.


Learning Objectives

  • Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
  • Use contig_stats.pl to display assembly statistics.
  • Find proteins of interest in an assembly using Blast.


Installing SPAdes

Unfortunately, SPAdes does not exist as a module for loading on TACC nor is it available in the BioITeam materials. As it is available through the SPAdes website as binaries, is well supported, and doesn't require complex dependancies making it easy to install.

 If SPAdes is so common a tool, why doesn't the BioITeam install it for everyone?

In my opinion there are a few reasons:

  1. Generally speaking, while SPAdes is commonly used for assemblies, assemblies themselves are not very common as once you have an assembled genome, you use that genome for future analysis rather than redoing the assembly.
  2. Since it is easily installed, it doesn't save people much work to install it for them.
  3. As we have seen in a few of our other tutorials, things installed in the BioITeam are subject to upkeep by others and can break when modules or other programs are installed.

First, navigate to the SPAdes home page http://cab.spbu.ru/software/spades/ and download the linux binary distribution either directly to TACC using wget. While you could put the file anywhere on lonestar (and can easily move it around on lonestar with the mv command once it is there), I suggest downloading the file to a 'src' folder on $WORK as this is a good habit to get into. 

Making a DIRectory named SRC in $WORK (the capital letters are your clues)
mkdir $WORK/src

Note that idev nodes have a tendency to download files from the internet much slower than the same file would download from the head node. If you are already in an idev node, it is likely faster to logout of the idev node (with the 'logout' command), execute the wget command listed below on the head node, and then start a new idev node after the download is complete.


How to use wget to download directly to TACC
cd $WORK/src
wget http://cab.spbu.ru/files/release3.13.0/SPAdes-3.13.0-Linux.tar.gz


Once the .tar.gz file has been placed in the $WORK/src folder using one of the above options, you need to extract the files.

hint eXtracting a .tar.gz file is the opposite of Creating one (hints are in the capital letters)
cd $WORK/src
tar -xvzf SPAdes-3.13.0-Linux.tar.gz
# from the help file:
  # x = Extract
  # v = verbose
  # z = file is also gzipped
  # f = force 

Now that the files have been extracted you have a choice in how to use them: 1 option is t