Overview

SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.

A note about read preprocessing

While not explicitly covered here, the presence of adapter sequences on reads when trying to assemble them can significantly complicate assembly and harm it. If using this tutorial on your own samples make sure you are working with the best data possible .. reads lacking adapters in this case.

For those looking for a real challenge, go through the multiqc tutorial and the trimmomatic tutorial, and use the information provided here to compare assemblies of some of the same samples in both cases.

Learning Objectives

Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
Use contig_stats.pl to display assembly statistics.
Find proteins of interest in an assembly using Blast.

Installing SPAdes

As genome assembly is important part of analysis but is building a reference file that will be used many times, it makes more sense to install it its own environment. Other potential tools to have in the same environment would be read preprocessing tools, in particular adapter removal tools such as trimmomatic.

conda create --name GVA-SPAdes -c bioconda spades

Testing SPAdes installation

SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different.

SPAdes self test

conda activate GVA-SPAdes
mkdir $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial
spades.py --test
spades.py --version

Assuming everything goes correctly, there will be a large number of lines that pass pretty quickly with the last lines printed to the screen should being:

Correct SPAdes output

======= SPAdes pipeline finished.

========= TEST PASSED CORRECTLY.

SPAdes log can be found here: <$SCRATCH>/GVA_SPAdes_tutorial/spades_test/spades.log

Thank you for using SPAdes!

The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up. And then a version response of:

Correct SPAdes output

SPAdes v3.13.0

If the end of the spades test gives different output do not continue.

Get my attention on zoom and we'll figure out what is going on.

Set 1: Plasmid SPAdes

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node. If you still run into memory problems, consider moving onto the 'large-mem' queue rather than the 'normal' queue which has more memory, and also downsampling your data.

Assembling even small bacterial genomes can be incredibly time intensive (as well as memory intensive as highlighted above). Fortunately for this class, we can make use of the plasmid spades option to assemble and even smaller plasmid genome that is ~2000 bp long in only a few minutes. I suggest analyzing this data on an idev node and then submitting the other data analysis for the bacterial genomes as a job to run overnight.

Data

Download the paired end fastq files which have had their adapters trimmed from the $BI/gva_course/Assembly/ directory.

You should know the copy command by now, try to get it on your own before checking your answer here

cp $BI/gva_course/Assembly/*.fastq.gz $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively. For this tutorial lets use plasmidspades.py

Using the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o <output_dir> directory to store all the resulting files (required)

And we will need to supply the read files to the program. In this case we are looking for the following options:

-1 <filename> file with forward paired-end reads

-2 <filename> file with reverse paired-end reads

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 68 cores that are available on your idev node (-t 68). The following command is expected to take less than 2 minutes.

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

Did you come up with the same thing I did?

plasmidspades.py -t 68 -o plasmid -1 SH1_1P.fastq.gz -2 SH1_2P.fastq.gz

Evaluating the output

As you can see from listing the contents of the output 'plasmid' directory, several new files have been generated. There are two files that I consider to be the most important. 1. contigs.fasta as this is the actual result of all the different contigs that were created. For circular chromosomes (such as plasmids) the goal would be that there is a single contig meaning that all of the reads were able to close the circle. 2. spades.log as it has the information about the completed run that you can use to compare different samples or conditions in the event that you are interested trying to optimize the command options, as would likely be the case if you were trying to assemble the best reference possible.

Looking at the contigs.fasta file can you answer the following questions (it is small enough to interrogate with cat or any other program)?

How many contigs were generated?
just 1 (its a fasta file so you focus on the > symbols to identify each different contig that is present)
How how long is each the contig?
2046bp
How deep is the coverage of this plasmid?
In this case the answer is ~180. This value can be particularly useful when you are trying to determine if novel DNA is present as a multi copy plasmid, or as something that has inserted into the chromosome. If it is inserted, you would expect the coverage to be similar to that of the chromosome, if it is a plasmid, it could be significantly higher.

Set 2: Whole Genome Simulated Data

Here we will look at 4 sets of data with library preparation conditions to evaluate how wet lab decisions influence outcomes on the computer. Some of the text here is very similar or identical to that in set 1 incase people choose to skip directly to it.

Data

Move to scratch, copy the raw data, and change into this directory for the tutorial

mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Files in the tutorial directory

paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:

	Set 1	Set 2	Set 3	Set 4
Read Size	100	100	100	100
Paired/Single Reads	Single	Paired	Paired	Paired
Gap Sizes	NA	400	400, 3000	400, 3000, 1500
Coverage	50	50	25 for each subset	20 for each subset
Number of Subsets	1	1	2	3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs. This is not something you will encounter very often if at all.

Look at the first 2 reads of the Interleaved fastq file paired_end_2x100_ins_1500_c_20.fastq

head -n 8 paired_end_2x100_ins_1500_c_20.fastq

And your expected output is:

@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+@55G@T@@I&+@A+@@@II@G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?@I@@@IIGGI@@A4@6@A,+AT=@G@+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T@GG@@+++1+@GI@+ICI+A+@@I@++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Notice how the pairs of reads are denoted by the /1 and /2 at the end of the first line in the 4 line fastq block. More often (and everywhere else in this course) your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

Using the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o <output_dir> directory to store all the resulting files (required)

And we will need to supply the read files to the program. In our case we are looking for the following options:

--12 <filename> file with interlaced forward and reverse paired-end reads

-s <filename> file with unpaired reads

It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 68 threads that are available (-t 68).

Did you come up with the same thing I did?

spades.py -t 68 -o single_end -s single_end_100_c_50.fastq

Consider adding a few more commands to show the effect of increasing fragment size, and be sure to give them their own output name:

Possible other commands

spades.py -t 68 -o 400_1500_3000 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq --12 paired_end_2x100_ins_3000_c_25.fastq
spades.py -t 68 -o 400_and_1500 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq
spades.py -t 68 -o 400_only --12 paired_end_2x100_ins_400_c_50.fastq

Put all 4 of the commands into a file named spades_commands. Be sure to ask for help if you are unsure how to use nano to do this.

A warning on memory usage

SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 4 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the "normal" que and may need to downsample your data.

Submitting the job

Once you have decided on the combinations you want to evaluate, use the 'wc -l' command to verify that your spades_commands file has 4 commands as you expect.

As we have seen in other tutorials involving the job queue system, we need a slurm file and need to modify it according to what we are actually trying to run.

Modify your slurm file to control the queue system's computer

cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm spades.slurm
nano spades.slurm

Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

Line number	As is	To be
16	#SBATCH -J jobName	#SBATCH -J spades
17	#SBATCH -n 1	#SBATCH -n 4
18	#SBATCH -N 1	#SBATCH -N 4
21	#SBATCH -t 12:00:00	#SBATCH -t 4:30:00
22	##SBATCH --mail-user=ADD	#SBATCH --mail-user=<YourEmailAddress>
23	##SBATCH --mail-type=all	#SBATCH --mail-type=all
27	conda activate GVA2021	conda activate GVA-SPAdes
31	export LAUNCHER_JOB_FILE=commands	export LAUNCHER_JOB_FILE=spades_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 assumes you installed spades in its own environment named GVA-SPAdes at the beginning of this tutorial.

Again use ctl-o and ctl-x to save the file and exit.

submit the job to run on the que

sbatch spades.slurm

Evaluating the output

Explore each output directory that was created for each set of reads you interrogated. The actionable information is in the contigs.fasta file. The contig file is a fasta file ordered based on the length of the individual contig in decreasing order. The names of each individual contig lists the number of the contig (largest contig being named NODE_1 next largest being named NODE_2 and so on) followed by the length of the contig, and the coverage (labeled as cov on the line). Generally, the lower number of total contigs and the larger the length of each are regarded as better assemblies, but the number of chromosomes present in the organism is an important factor as well.

The grep command can be quite useful for isolating the names of the contigs with the information, especially when combined with the -c option to count the total number of contigs, or piping the results to head/tail or both head and tail to isolate the top/bottom contigs.

Example grep commands

# Count the total number of contigs:
grep -c "^>" single_end/contigs.fasta

# Determine the length of the 5 largest contigs:
grep "^>" single_end/contigs.fasta | head -n 5

# Determine the length of the 20 smallest contigs:
grep "^>" single_end/contigs.fasta | tail -n 20

# Determine the length of the 100th through 110th contigs:
grep "^>" single_end/contigs.fasta | head -n 110 | tail -n 10

Since you ran multiple different combinations of reads for the simulated data how did the insert size effect the number of contigs? the length of the largest contigs? Why might larger insert sizes not help things very much?

Answer

The length of repetitive elements in the genome plays a large role in how easily it can be assembled as large repeats need even larger insert sizes to be spanned by single read pairs.

The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with the "perfect" simulated data?

answers

There are 7 nearly identical ribosomal RNA operons in E. coli spaced throughout the chromosome. Since each is >3000 bases, contigs cannot be connected across them using this data.

What comes next when working with your own data?

Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer.
You can turn the contigs.fa into a blast database (formatdb or makeblastdb depending on which version of blast you have) or try multiple sequence alignments through NCBIs blast.
If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
If you don't think the contigs you have are "good enough"
1. Verify you have trimmed your reads to the best they can be using fastq, multiqc, and trimmomatic
2. Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
3. Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.

Original spades publication

Return to GVA2021 page.

Genome Assembly (SPAdes) -- GVA2021

Overview

Learning Objectives

Installing SPAdes

Testing SPAdes installation

Set 1: Plasmid SPAdes

Data

SPAdes Assembly

Evaluating the output

Set 2: Whole Genome Simulated Data

Data

SPAdes Assembly

Submitting the job

Evaluating the output

What comes next when working with your own data?