1 Overview
2 Learning Objectives
3 Relationship to other tutorials
4 Identification of a novel plasmid
- 4.1 Get some data
- 4.2 Map read using bowtie2
- 4.3 Assemble unmapped reads
- 4.4 Compare contigs generated from different assemblies
  - 4.4.1 Where does the answer come from?
  - 4.4.2 Where does the answer come from?
  - 4.4.3 Where does the answer come from?
  - 4.4.4 Where does the answer come from?
5 Next steps
- - 5.1.1 Steps to identify phiX
- 5.2 Next steps:

Overview

As has been mentioned several times, variants are anything that is different from a reference genome, but large insertions of novel DNA represent something of an unknown unknown. If a read doesn't map is it because it's some kind of contamination, or is it because something new has entered into the sample. This tutorial is an attempt to use tools you are already familiar with to identify such novel DNA mutations.

Prerequisite required

This tutorial assumes you have already installed spades in the assembly tutorial. Verify you have done the minimal amount from that tutorial by activating your environment (conda activate GVA-Assembly) and check the version with the 'spades.py -v' command returns 3.15.4. If not you will need to complete at least the installation and testing of spades described here Genome Assembly (SPAdes) -- GVA2022#InstallingSPAdes

Learning Objectives

Extract reads where one or both reads do not map to reference
de novo assemble reads

Relationship to other tutorials

As presence of adapter can cause problems with assembly, make sure adapters have been trimmed
Reads will be mapped with bowtie2
non-mapped reads will be extracted, consider checking quality of these reads and possible additional trimming as non-mapped reads may have not mapped due to low quality, and thus could be further improved.
Reads will then be assembled

None of these other tutorials are required to complete this tutorial, but additional information about individual steps may be found there.

Identification of a novel plasmid

One example of novel DNA being present is when a given sample may have a virus or plasmid associated with a sample. Here we will take a sample known to have a high copy plasmid associated with it, but map the reads against only the genome. Unaligned reads would then be expected to be able to assemble into a plasmid.

Get some data

mkdir $SCRATCH/GVA_novel_DNA
cd $SCRATCH/GVA_novel_DNA
cp $BI/gva_course/novel_DNA/* .
ls

^ above transfers 2 gzipped fastq files and a reference file.

Map read using bowtie2

This is the same process used in the read mapping tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information. Recall we installed bowtie2 in our GVA-bowtie2-mapping environment

activate conda envrionment
conda activate GVA-bowtie2-mapping bowtie2 --version
You should see version 2.4.5
Convert reference to fasta
module load bioperl bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta
Index the reference
mkdir bowtie2 bowtie2-build --threads 68 CP009273.1_Eco_BW25113.fasta bowtie2/CP009273.1_Eco_BW25113
The following command will take ~5 minutes to complete. Before you run the command execute 'bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial. Note that in this case we are passing gzipped (compressed) fastq files to bowtie2 rather than uncompressed fastq files. Storing compressed files saves a significant amount of space, and often can be worked with without needing to directly decompress them.
Map reads
bowtie2 --very-sensitive-local -t -p 68 -x bowtie2/CP009273.1_Eco_BW25113 -1 SRR4341249_1.fastq.gz -2 SRR4341249_2.fastq.gz -S bowtie2/SRR4341249-vsl.sam --un-conc SRR4341249-unmapped-vsl.fastq

Assemble unmapped reads

This is the same process used in the plasmid assembly portion of the genome assembly tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information.

Attempt to assemble with plasmidspades.py which expects to find circularized plasmid type sequences. Command expected to take ~15 minutes
run plasmid spades
conda activate GVA-Assembly plasmidspades.py -t 68 -o unmapped_plasmid -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq
Additionally attempt to assemble with base spades.py command which uses different internal settings to potentially identify different types of novel DNA from our unmapped reads. Command expected to take ~15 minutes.
run plasmid spades
spades.py -t 68 -o unmapped_full -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq

Compare contigs generated from different assemblies

Recall from genome assembly tutorial the file we are most interested in is the contigs.fasta file in each output directory (unmapped_plasmid and unmapped_full).

Next steps

Here we have presented a proof of concept that unmapped reads can be used to find something that we actually did know was present. We also found something that was even longer that wasn't expected.

Additional questions are:

Novel DNA Identification -- GVA2022