Overview
SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.
Learning Objectives
- Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
Table of Contents
Installing SPAdes
Unfortunately, SPAdes does not exist as a module for loading on TACC nor is it available in the BioITeam materials. As it is available through the SPAdes website as binaries, is well supported, and doesn't require complex dependancies making it easy to install.
First, navigate to the SPAdes home page http://cab.spbu.ru/software/spades/ and download the linux binary distribution either directly to TACC using wget, or first downloading it to your laptop then transferring it to to TACC using SCP. While you could put the file anywhere on lonestar (and can easily move it around on lonestar with the mv command once it is there), I suggest downloading or transferring the file to a 'src' folder on $WORK.
Do one of the following, (or both if you want practice moving files around):
Try to use 'wget -h' before clicking below. When using wget it is often helpful to right click on a link and select 'copy link address' when the file you want is available through a download link.
Remember that scp has 2 parts after the command name just like the cp command: 1. the location the file currently is, and 2. the location you want to copy the file to. Most of the class has dealt with moving things from TACC to your computer, but in this case things will move the opposite direction, think about what needs to change about your SCP command to accomplish this.
Once the .tar.gz file has been placed in the $WORK/src folder using one of the above options, you need to extract the files.
Now that the files have been extracted you have a choice in how to use them: 1 option is to copy the binary files to a location that is already in your path (such as the $HOME/local/bin directory we set up for you in your .bashrc file), and the second option is to add the $WORK/src/SPAdes-3.13.0-Linux/bin folder to your path. This is a personal preference and I do not know how prevalent either choice is among researchers. I know that my preference is to copy executable to known locations in the path rather than add a ton of different directories to my path, but others may feel differently. Below I present both options:
Doing both of the following may cause unintended effects in the future (particularly if you attempt to update the version of SPAdes you are using) and I do not recommend it.
Data
Tutorial assumes that you are on an idev node. If you are not sure please ask for help.
cds mkdir GVA_SPAdes_tutorial cp $BI/ngs_course/velvet/data/*/* GVA_SPAdes_tutorial cd GVA_SPAdes_tutorial
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq single_end_100_c_50.fastq paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_400_c_25.fastq paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_50.fastq
There are 4 sets of simulated reads:
Set 1 | Set 2 | Set 3 | Set 4 | |
---|---|---|---|---|
Read Size | 100 | 100 | 100 | 100 |
Paired/Single Reads | Single | Paired | Paired | Paired |
Gap Sizes | NA | 400 | 400, 3000 | 400, 3000, 1500 |
Coverage | 50 | 50 | 25 for each subset | 20 for each subset |
Number of Subsets | 1 | 1 | 2 | 3 |
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq @READ-1/1 TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA + *@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@ @READ-1/2 TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA + I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6
Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
Velvet Assembly
Now let's use Velvet to assemble the reads.
First, you will need to load the velvet module.
Using velvet consists of a sequence of two commands:
velveth
- analyzes kmers in the reads in preparation for assemblyvelvetg
- constructs the assembly and filters contigs from the graph
Look at the help for each program.
The <hash_length> parameter of velveth
is the kmer value that is key to the assembly process. Choosing it controls the tradeoff between sensitivity (lower hash_length, more reads included, longer contigs) and specificity (higher hash length, less chance of misassembly, more reads ignored, shorter contigs)). There is more discussion about choosing an appropriate kmer value in the Velvet manual and in this blog post.
Velvet has an option of specifying the insertion size of a paired read file (-ins_length). If no size is given, Velvet will guess the insertion size. We'll just have Velvet guess the size.
Velvet also has an option to specify the expected coverage of the genome, which helps it choose how to resolve repeated sequences (-exp_cov). We set this parameter to estimate this from the data. A common problem with using Velvet is that you have many very short contigs and the last line of output tells you that it used 0 of your reads. This is caused by not setting this parameter. The default is NOT auto.
The following commands will each take some time to run, while they run, read ahead to learn more about what you will be seeing:
velveth single_out 61 -fastq single_end_100_c_50.fastq && velvetg single_out -exp_cov auto -amos_file yes velveth pairedc20_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq && velvetg pairedc20_out -exp_cov auto -amos_file yes velveth pairedc25_out 61 -fastq -shortPaired paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_25.fastq && velvetg pairedc25_out -exp_cov auto -amos_file yes velveth pairedc50_out 61 -fastq -shortPaired paired_end_2x100_ins_400_c_50.fastq && velvetg pairedc50_out -exp_cov auto -amos_file yes
A warning on memory usage
Velvet and other assemblers usually take large amounts of RAM to complete. Running these 4 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the standard que.
What does the && symbol mean?
As we discussed in our piping tutorial, things separated by a ';' mark will execute 1 after the other. In the case of &&, the second command will only execute if the first command finishes correctly (exit status zero to get technical). This can be useful to consider when building a pipeline to limit progression from 1 program to another when something has failed, BUT only if you understand exit states of each program.
Velvet Output
Explore each output directory that was created by Velvet.
Checking the tail of the Log files in each of the output folders, we see lines like the following:
With better read pairs that link more distant locations in the genome, there are fewer contigs, and contigs are are longer, giving us a more complete picture of linkage across the genome.
The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with this "perfect" data?
More assembly statistics: contig_stats.pl
The output file stats.txt contains information about every contig in the assembly, but it isn't sorted and can be a bit overwhelming.
You can generate some summary stats and graphs about each assembly using the contig_stats.pl
script in the $BI/bin.
You probably want to change into the directory of results for a specific assembly before running this command, since it generates several output files in the current working directory. You will need to copy the PNG files back to your computer to view it using scp.
Transfer back several of the different outputs into their own direcotry and being comparing them to determine which library and set of parameters seemed to work best.
What do I do now?
Many choices:
- Get a better assembly: maybe add a different library size, or go into a detailed genome completion project (commonly called "finishing") using a sequence assembly editor like
consed
orgap4
orAMOS
. (Be careful though, the amount of data in NGS data sets can be very difficult for these programs to deal with, since many were designed for Sanger sequencing reads.) You may have a lot of PCR products to make to close gaps and/or to order and orient scaffolds.consed
in particular makes this pretty easy, but it may still consume a lot more time and money than the initial shotgun assembly. You can identify some misassemblies by mapping the original reads to the assembly and then viewing them in IGV to look for discordant mate pairs, for example. - Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Most assemblers will be able to take 2x100 data and give you full gene sequences since these are non-repetitive and so assemble well, and obviously 2x600 miSEQ runs will do even better. You can turn the contigs.fa into a blast database (
formatdb
ormakeblastdb
depending on which version of blast you have) and start blasting away.