Genome Assembly (SPAdes) -- GVA2022
Overview
SPAdes is a De Bruijn graph assembler which has become the preferred assembler in numerous labs and workflows. In this tutorial we will use SPAdes to assemble an E. coli genome from simulated Illumina reads. Genome assembly is quite difficult (though as Oxford Nanopore lowers its error rate and tools using both its long reads and illumina short reads the difficulty falls, while the accuracy increases). Genome assembly should only be used:
- When you can not find a reference genome that is close to your own.
- If you are engaged in metagenomic projects where you don't know what organisms may be present.
- There are other tools that can be useful for this type of work.
- In situations where you believe you may have novel sequence insertions into a genome of interest.
- Note that in this case however you might actually want to grab reads that do not map to your reference genome (and their pair in the case of paired end and mate-pair sequencing) rather than performing these functions on the fastq files you get from the raw sequencing.
A note about read preprocessing
While not explicitly covered here, the presence of adapter sequences on reads when trying to assemble them can significantly complicate assembly and decrease the accuracy. If using this tutorial on your own samples make sure you are working with the best data possible ... reads lacking adapters in this case with the largest insert sizes possible.
For those looking for a real challenge, go through the multiqc tutorial and the fastp tutorial, and use the information provided here to compare assemblies of some of the same samples in both cases.
Learning Objectives
- Run SPAdes to perform de novo assembly on fragment, paired-end, and mate-paired data.
- Use contig_stats.pl to display assembly statistics.
- Find proteins of interest in an assembly using Blast.
Installing SPAdes
As genome assembly is important part of analysis but is building a reference file that will be used many times, it makes more sense to install it its own environment. Other potential tools to have in the same environment would be read preprocessing tools, in particular adapter removal tools such as fastp. Supporting the suggestion made in the fastp tutorial that if environments are to be grouped together based on task, read pre-processing is a good environment
conda create --name GVA-Assembly -c bioconda spades -c conda-forge
Testing SPAdes installation
SPAdes comes with a self test option to make sure that the program is correctly installed. While this is not true of most programs, it is always a good idea to run whatever test data a program makes available rather than jumping straight into your own data as knowing there is an error in the program rather than your data makes troubleshooting very different.
conda activate GVA-Assembly mkdir $SCRATCH/GVA_SPAdes_tutorial cd $SCRATCH/GVA_SPAdes_tutorial spades.py --test spades.py --version
Assuming everything goes correctly, there will be a large number of lines that pass pretty quickly with the last lines printed to the screen should being:
======= SPAdes pipeline finished WITH WARNINGS! === Error correction and assembling warnings: * 0:00:01.927 1M / 47M WARN General (launcher.cpp : 178) Your data seems to have high uniform coverage depth. It is strongly recommended to use --isolate option. ======= Warnings saved to $SCRATCH/GVA_SPAdes_tutorial/spades_test/warnings.log SPAdes log can be found here: $SCRATCH/GVA_SPAdes_tutorial/spades_test/spades.log Thank you for using SPAdes!
The lines immediately above this text list different output files and results from the assembly and will be true of all SPAdes runs and can be helpful for keeping track of where all your output ends up. And then a version response of:
SPAdes v3.15.4
Since we didn't set any options, and only ran the prepackaged tests, ignoring the warning seems highly reasonable. If we got a similar warning with our own samples, rerunning the analysis and comparing the 2 results would be a good use of our time.
If the end of the spades test gives different output do not continue.
Get my attention on zoom and we'll figure out what is going on.
Set 1: Plasmid SPAdes
Unlike other times in the class where we are concerned about being good TACC citizens and not hurting other people by the programs we run, assembly programs are exceptionally memory intensive and attempting to run on the head node may result in the program returning a memory error rather than useable results. When it comes time to assemble your own reference genome, remember to give each sample its own compute node rather than having multiple samples split a single node. If you still run into memory problems, consider moving onto the 'large-mem' queue rather than the 'normal' queue which has more memory, and also downsampling your data.
Assembling even small bacterial genomes can be incredibly time intensive (as well as memory intensive as highlighted above). Fortunately for this class, we can make use of the plasmid spades option to assemble and even smaller plasmid genome that is ~2000 bp long in only a few minutes. I suggest analyzing this data on an idev node and then submitting the other data analysis for the bacterial genomes as a job to run overnight.
Data
Download the paired end fastq files which have had their adapters trimmed from the $BI/gva_course/Assembly/ directory.
SPAdes Assembly
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively. For this tutorial lets use plasmidspades.py
Once you have figured out what options you need to use see if you can come up with a command to run on the paired end reads and have the output go into a new directory called plasmid using all 68 cores that are available on your idev node (-t 68). The following command is expected to take less than 2 minutes.
Remember to make sure you are on an idev done
For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.
Evaluating the output
As you can see from listing the contents of the output 'plasmid' directory, several new files have been generated. There are two files that I consider to be the most important. 1. contigs.fasta as this is the actual result of all the different contigs that were created. For circular chromosomes (such as plasmids) the goal would be that there is a single contig meaning that all of the reads were able to close the circle. 2. spades.log as it has the information about the completed run that you can use to compare different samples or conditions in the event that you are interested trying to optimize the command options, as would likely be the case if you were trying to assemble the best reference possible. Interestingly, the spades.log file is equivalent to if you had redirected the error and screen printing to a log file yourself (ie using &> as was done in the fastp tutorial).
Looking at the contigs.fasta file can you answer the following questions? (it is small enough to interrogate with cat or any other program)
Visualizing the aseemebly
Another file that maybe of interest is (especially if you are going to try to manually make improvements to the assembly or take a targeted approach to improving the assembly) the assembly_graph.fastg. I would recommend opening this file with the bandage program. https://rrwick.github.io/Bandage/ it is lightweight and easily installed on all systems and while it is pretty intuitive it does have robust documentation https://github.com/rrwick/Bandage/wiki. Viewing this plasmid in bandage will effectively just show you a circle as it is completely closed. The good news is that bandage is powerful enough to support larger genomes which may be of help or interest in the simulated data set.
Set 2: Whole Genome Simulated Data
Here we will look at 4 sets of data with library preparation conditions to evaluate how wet lab decisions influence outcomes on the computer. Some of the text here is very similar or identical to that in set 1 incase people choose to skip directly to it.
Data
mkdir $SCRATCH/GVA_SPAdes_tutorial # you likely already did this when you ran the selftest cp $BI/ngs_course/velvet/data/*/* $SCRATCH/GVA_SPAdes_tutorial cd $SCRATCH/GVA_SPAdes_tutorial
Now we have a bunch of Illumina reads. These are simulated reads. If you'd ever like to simulate some on your own, you might try using Mason.
paired_end_2x100_ins_1500_c_20.fastq paired_end_2x100_ins_400_c_20.fastq single_end_100_c_50.fastq paired_end_2x100_ins_3000_c_20.fastq paired_end_2x100_ins_400_c_25.fastq paired_end_2x100_ins_3000_c_25.fastq paired_end_2x100_ins_400_c_50.fastq
There are 4 sets of simulated reads:
Set 1 | Set 2 | Set 3 | Set 4 | |
---|---|---|---|---|
Read Size | 100 | 100 | 100 | 100 |
Paired/Single Reads | Single | Paired | Paired | Paired |
Gap Sizes | NA | 400 | 400, 3000 | 400, 3000, 1500 |
Coverage | 50 | 50 | 25 for each subset | 20 for each subset |
Number of Subsets | 1 | 1 | 2 | 3 |
Note that these fastq files are "interleaved", with each read pair together one-after-the-other in the file. The #/1 and #/2 in the read names indicate the pairs. This is not something you will encounter very often if at all.
And your expected output is:
@READ-1/1 TTTCACCGTTGACCAGCACCCAGTTCAGCGCCGCGCGACCACGATATTTTGGTAACAGCGAACCATGCAGATTAAATGCACCTGCGGGAGCGAGCTGCAA + *@A+@55G@T@@I&+@A+@@@II@G@+++A++GG++@++I@+@+G&/+I+GD+II@++G@@I?@I@@@IIGGI@@A4@6@A,+AT=@G@+@AA+GAG++@ @READ-1/2 TTAACACCGGGCTATAAGTACAATCTGACCGATATTAACGCCGCGATTGCCCTGACACAGTTAGTCAAATTAGAGCACCTCAACACCCGTCGGCGCGAAA + I@@H+A+@G+&+@AG+I>G+I@+CAIA++$+T@GG@@+++1+@GI@+ICI+A+@@I@++A+@@A.@<G@@+)GCGC%I@IIAA++++G+A;@+++@@@@6
Notice how the pairs of reads are denoted by the /1 and /2 at the end of the first line in the 4 line fastq block. More often (and everywhere else in this course) your read pairs will be "separate" with the corresponding paired reads at the same index in two different files (each with exactly the same number of reads).
SPAdes Assembly
Now let's use SPAdes to assemble the reads. As always its a good idea to get a look at what kind of options the program accepts using the -h option. SPAdes is actually written in python and the base script name is "spades.py". There are additional scripts that change many of the default options such as metaspades.py, plasmidspades.py, and rnaspades.py or these options can be set from the main spades.py script with the flags --meta, --plasmid, --rna respectively.
Once you have figured out what options you need to use see if you can come up with a command to run on the single end and have the output go into a new directory called single_end using all 68 threads that are available (-t 68).
Consider adding a few more commands to show the effect of increasing fragment size, and be sure to give them their own output name:
spades.py -t 68 -o 400_1500_3000 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq --12 paired_end_2x100_ins_3000_c_25.fastq spades.py -t 68 -o 400_and_1500 --12 paired_end_2x100_ins_400_c_50.fastq --12 paired_end_2x100_ins_1500_c_20.fastq spades.py -t 68 -o 400_only --12 paired_end_2x100_ins_400_c_50.fastq
Put all 4 of the commands into a file named spades_commands. Be sure to ask for help if you are unsure how to use nano to do this.
A warning on memory usage
SPAdes (and most/all other assemblers) usually take large amounts of RAM to complete. Running these 4 commands on a single node at the same time will likely use more RAM than is available on a single node so it's necessary to run them sequentially or on their own node. This should also underscore to you that you should not run this on the head node. If you are assembling large genomes or have high coverage depth data in the future, you will probably need to submit your jobs to the "largemem" queue rather than the "normal" que and may need to downsample your data.
Submitting the job
Once you have decided on the combinations you want to evaluate, use the 'wc -l
' command to verify that your spades_commands file has 4 commands as you expect.
As we have seen in other tutorials involving the job queue system, we need a slurm file and need to modify it according to what we are actually trying to run.
cp /corral-repl/utexas/BioITeam/gva_course/GVA2022.launcher.slurm spades.slurm nano spades.slurm
Again while in nano you will edit most of the same lines you edited in the in the breseq tutorial. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is
Line number | As is | To be |
---|---|---|
16 | #SBATCH -J jobName | #SBATCH -J spades |
17 | #SBATCH -n 1 | #SBATCH -n 4 |
18 | #SBATCH -N 1 | #SBATCH -N 4 |
21 | #SBATCH -t 12:00:00 | #SBATCH -t 4:30:00 |
22 | ##SBATCH --mail-user=ADD | #SBATCH --mail-user=<YourEmailAddress> |
23 | ##SBATCH --mail-type=all | #SBATCH --mail-type=all |
29 | export LAUNCHER_JOB_FILE=commands | export LAUNCHER_JOB_FILE=spades_commands |
The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.
Again use ctl-o and ctl-x to save the file and exit.
sbatch spades.slurm
Evaluating the output
Explore each output directory that was created for each set of reads you interrogated. The actionable information is in the contigs.fasta file. The contig file is a fasta file ordered based on the length of the individual contig in decreasing order. The names of each individual contig lists the number of the contig (largest contig being named NODE_1 next largest being named NODE_2 and so on) followed by the length of the contig, and the coverage (labeled as cov on the line). Generally, the lower number of total contigs and the larger the length of each are regarded as better assemblies, but the number of chromosomes present in the organism is an important factor as well.
The grep command can be quite useful for isolating the names of the contigs with the information, especially when combined with the -c option to count the total number of contigs, or piping the results to head/tail or both head and tail to isolate the top/bottom contigs.
# Count the total number of contigs: grep -c "^>" single_end/contigs.fasta # Determine the length of the 5 largest contigs: grep "^>" single_end/contigs.fasta | head -n 5 # Determine the length of the 20 smallest contigs: grep "^>" single_end/contigs.fasta | tail -n 20 # Determine the length of the 100th through 110th contigs: grep "^>" single_end/contigs.fasta | head -n 110 | tail -n 10
Since you ran multiple different combinations of reads for the simulated data how did the insert size effect the number of contigs? the length of the largest contigs? Why might larger insert sizes not help things very much?
The complete E. coli genome is about 4.6 Mb. Why weren't we able to assemble it, even with the "perfect" simulated data?
Visualizing the assembly
As mentioned above the bandage program (https://rrwick.github.io/Bandage/) can be quite helpful in understanding what is going on.
What comes next when working with your own data?
- Look for things: If you're just after a few homologs, an operon, etc. you're probably done. Think about what question you are trying to answer.
- You can turn the contigs.fa into a blast database (
formatdb
ormakeblastdb
depending on which version of blast you have) or try multiple sequence alignments through NCBIs blast. - If you built your contigs based on a normal/control sample you can map other reads to the contigs using bowtie2 to try to identify variants in other samples.
- If you don't think the contigs you have are "good enough"
- Verify you have trimmed your reads to the best they can be using fastq, multiqc, and fastp
- Try using Spades MismatchCorrector to see if you can improve the contigs you already have.
- Add additional sequencing libraries to try to connect some more contigs. Especially think about pacbio sequencing and oxford nanopore.
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache. If you require further assistance, please email wikihelp@utexas.edu.