/
Genome Assembly (SPAdes) -- GVA2019

Genome Assembly (SPAdes) -- GVA2019

Overview

SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.

Learning Objectives

  • Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
  • Use contig_stats.pl to display assembly statistics.
  • Find proteins of interest in an assembly using Blast.

Table of Contents

Installing SPAdes

Unfortunately, SPAdes does not exist as a module for loading on TACC nor is it available in the BioITeam materials. As it is available through the SPAdes website as binaries, is well supported, and doesn't require complex dependancies making it easy to install.

 If SPAdes is so common a tool, why doesn't the BioITeam install it for everyone?

In my opinion there are a few reasons:

  1. Generally speaking, while SPAdes is commonly used for assemblies, assemblies themselves are not very common as once you have an assembled genome, you use that genome for future analysis rather than redoing the assembly.
  2. Since it is easily installed, it doesn't save people much work to install it for them.

First, navigate to the SPAdes home page http://cab.spbu.ru/software/spades/ and download the linux binary distribution either directly to TACC using wget, or first downloading it to your laptop then transferring it to to TACC using SCP. While you could put the file anywhere on lonestar (and can easily move it around on lonestar with the mv command once it is there), I suggest downloading or transferring the file to a 'src' folder on $WORK.

Making a DIRectory named SRC in $WORK (the capital letters are your clues)
mkdir $WORK/src

Do one of the following, (or both if you want practice moving files around):

  1. Try to use 'wget -h' before clicking below. When using wget it is often helpful to right click on a link and select 'copy link address' when the file you want is available through a download link.

    How to use wget to download directly to TACC
    #Note that idev nodes have a tendency to download files from the internet much slower than the same file would download from the head node. If you are already in an idev node, it is likely faster to logout of the idev node (with the 'logout' command), execute the wget command listed below on the head node, and then start a new idev node after the download is complete.
    cd $WORK/src
    wget http://cab.spbu.ru/files/release3.13.0/SPAdes-3.13.0-Linux.tar.gz
  2. Remember that scp has 2 parts after the command name just like the cp command: 1. the location the file currently is, and 2. the location you want to copy the file to. Most of the class has dealt with moving things from TACC to your computer, but in this case things will move the opposite direction, think about what needs to change about your SCP command to accomplish this.

    How to use SCP to transfer the downloaded file to TACC from your laptop (MAC)
    In a terminal window of your laptop not LS5
    scp ~/Downloads/SPAdes-3.13.0-Linux.tar.gz <taccuserID>@ls5.tacc.utexas.edu:<$WORK pwd>/src # Note you need to replace $WORK with the output from the pwd command on TACC

Once the .tar.gz file has been placed in the $WORK/src folder using one of the above options, you need to extract the files.

hint eXtracting a .tar.gz file is the opposite of Creating one (hints are in the capital letters)
cd $WORK
tar -xvzf SPAdes-3.13.0-Linux.tar.gz
# from the help file:
  # x = Extract
  # v = verbose
  # z = file is also gzipped
  # f = force 

Now that the files have been extracted you have a choice in how to use them: 1 option is to copy the binary files to a location that is already in your path (such as the $HOME/local/bin directory we set up for you in your .bashrc file), and the second option is to add the $WORK/src/SPAdes-3.13.0-Linux/bin folder to your path. This is a personal preference and I do not know how prevalent either choice is among researchers. I know that my preference is to copy executable to known locations in the path rather than add a ton of different directories to my path, but others may feel differently. Below I present both options:

Doing both of the following may cause unintended effects in the future (particularly if you attempt to update the version of SPAdes you are using) and I do not recommend it.

Copy executables to somewhere already in your path (IE $HOME/local/bin)
cp $WORK/src/SPAdes-3.13.0-Linux/bin/* $HOME/local/bin  #Note that by specifying the full path all the files and the destination, this command can be run from anywhere on TACC.
Suggested line to add to your .bashrc file to directly access executables
export PATH=$WORK/src/SPAdes-3.13.0-Linux/bin:$PATH
# This line must be added to the .bashrc file found in your $HOME directory, in section 2. I typically add all modifications 1 after the other so the most recent thing i have added is the last line in this section.

If you have modified your PATH variable, it is a good idea to log out of TACC and log back in before continuing.

Testing SPAdes installation

SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different. 

SPAdes self test
mkdir $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial
spades.py --test

Assuming everything goes correctly, the last lines printed to the screen should be:

Correct SPAdes output
======= SPAdes pipeline finished.

========= TEST PASSED CORRECTLY.

SPAdes log can be found here: <$SCRATCH>/GVA_SPAdes_tutorial/spades_test/spades.log

Thank you for using SPAdes!

The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up.

If the end of the spades test gives different output do not continue.