Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though if Oxford Nanopore lowers its error rate assembly will likely get much easier and involve new tools). Genome assembly should only be used when you can not find a reference genome that is close to your own, if you are engaged in metagenomic projects where you don't know what organisms may be present, and in situations where you believe you may have novel sequence insertions into a genome of interest (Note that in this case however you would actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.

...

Warning
titleIf the end of the spades test gives different output do not continue.


Set 1: Plasmid SPAdes

Set 2: SPAdes example Data

SPAdes provides two example data sets and benchmarks for how long the pipeline might take to run http://spades.bioinf.spbau.ru/release3.11.1/manual.html#sec1.3. In this section we will download the data for the standard E. coli data and compare their results with ours.

Code Block
languagebash
titleDownload data
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_1.fastq.gz
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_2.fastq.gz

Set 3: Whole Genome Simulated Data

Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help. 

Tip

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node will likely result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.

Data

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block
titleFiles in the tutorial directory
paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:

...

Set 1

...

Set 2

...

Set 3

...

Set 4

...

Read Size

...

100

...

100

...

100

...

100

...

Paired/Single Reads

...

Single

...

Paired

...

Paired

...

Paired

...

Gap Sizes

...

NA

...

400

...

400, 3000

...

400, 3000, 1500

...

Coverage

...

50

...

50

...

25 for each subset

...

20 for each subset

...

Number of Subsets

...

1

...

1

...

2

...

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block
titleInterleaved fastq
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

...

titleUsing the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

...

And we will need to supply the read files to the program. In our case we are looking for the following options:

...

It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.

spades.py -t 48 -s single_end_100_c_50.fastq -o single_end
Code Block
languagebash
titleDid you come up with the same thing I did?
collapsetrue
Tip

Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node.

Set 2: SPAdes example Data

This assembly may take more than 2 hours to complete and like the next data set it may be better to run this as a job overnight and analyze the data in tomorrow's class. If you choose to attempt to run this data as an interactive session remember that if your idev session runs out of time, the program will stop working and you will have to start over. If you want to submit it as a job, take the commands you want to run (ie the commands that start with "spades.py" and put them in a file named 'commands' using nano. You can include both commands for this data set and the next data set in the same commands file. Finally use the 'launcher_creator.py' command to generate a ".slurm" file and check with instructor that everything is set up correctly so the job will run overnight.

Data

SPAdes provides two example data sets and benchmarks for how long the pipeline might take to run http://spades.bioinf.spbau.ru/release3.11.1/manual.html#sec1.3. In this section we will download the data for the standard E. coli data and compare their results with ours.

Code Block
languagebash
titleDownload data
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cd $SCRATCH/GVA_SPAdes_tutorial
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_1.fastq.gz
wget http://spades.bioinf.spbau.ru/spades_test_datasets/ecoli_mc/s_6_2.fastq.gz

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

Expand
titleUsing the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o<output_dir>directory to store all the resulting files (required)

And we will need to supply the read files to the program. In this case we are looking for the following options:

-1<filename>file with forward paired-end reads

-2<filename>file with reverse paired-end reads

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.


Code Block
languagebash
titleDid you come up with the same thing I did?
collapsetrue
spades.py -o example -t 48 -1 s_6_1.fastq.gz -2 s_6_2.fastq.gz

Set 3: Whole Genome Simulated Data

Program self tests are typically safe to run on the head node, but the rest of the Tutorial assumes that you are on an idev node. If you are not sure please ask for help. 

Data

Code Block
titleMove to scratch, copy the raw data, and change into this directory for the tutorial
mkdir  $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest
cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial

Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.

Code Block
titleFiles in the tutorial directory
paired_end_2x100_ins_1500_c_20.fastq  paired_end_2x100_ins_400_c_20.fastq  single_end_100_c_50.fastq
paired_end_2x100_ins_3000_c_20.fastq  paired_end_2x100_ins_400_c_25.fastq
paired_end_2x100_ins_3000_c_25.fastq  paired_end_2x100_ins_400_c_50.fastq

There are 4 sets of simulated reads:


Set 1

Set 2

Set 3

Set 4

Read Size

100

100

100

100

Paired/Single Reads

Single

Paired

Paired

Paired

Gap Sizes

NA

400

400, 3000

400, 3000, 1500

Coverage

50

50

25 for each subset

20 for each subset

Number of Subsets

1

1

2

3

Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs.

Code Block
titleInterleaved fastq
tacc:~$ head paired_end_2x100_ins_1500_c_20.fastq
@READ-1/1
TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA
+
*@A+<at:var at:name="55G" />T@@I&+@A+@@<at:var at:name="II" />G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?<at:var at:name="I" />@<at:var at:name="IIGGI" /><at:var at:name="A4" />6@A,+AT=<at:var at:name="G" />+@AA+GAG++@
@READ-1/2
TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA
+
I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T<at:var at:name="GG" />@+++1+<at:var at:name="GI" />+ICI+A+@<at:var at:name="I" />++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6

Often your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).

SPAdes Assembly

Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.

Expand
titleUsing the -h option, can you determine what the only required option(s) for the spades program is/are?

The first option in the basic option is:

-o<output_dir>directory to store all the resulting files (required)

And we will need to supply the read files to the program. In our case we are looking for the following options:

--12<filename>file with interlaced forward and reverse paired-end reads

-s<filename>file with unpaired reads



It would be more common for us to be using -1 and -2 for each of the paired end reads in normal situations rather than the -12 option, but as mentioned above this data is supplied to you as interleaved which many/most programs will accept, but require you to specify them differently

Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 48 threads that are available (-t 48). The following command is expected to take between 60 and 70 minutes.

Code Block
languagebash
titleDid you come up with the same thing I did?
collapsetrue
spades.py -t 48 -o single_end -s single_end_100_c_50.fastq 


If you are planning to run a job overnight, consider adding additional combinations of reads as individual commands to get an idea of how different insert sizes can play a role in final contig lenghts. Just remember that insert length should always increase:

Code Block
languagebash
titlePossible other commands
linenumberstrue
spades.py -t 48 -o 400_1500_3000 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq --12 paired_end_2x100_ins_3000_c_25.fastq
spades.py -t 48 -o 400_and_1500 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq
spades.py -t 48 -o 400_only --12 paired_end_2x100_ins_400_c_50.fastq 
Warning
titleA warning on memory usage

Velvet SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 4 3 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the standard "normal" que.


Interrogating the output

...

  1. Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer. 
  2. You can turn the contigs.fa into a blast database (formatdb or makeblastdb depending on which version of blast you have) or try multiple sequence alignments through NCBIs blast.
  3. If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
  4. If you don't think the contigs you have are "good enough"
    1. Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
    2. Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.

Original spades publication

Return to GVA2019