Novel DNA Identification -- GVA2023
Overview
As has been mentioned several times, variants are anything that is different from a reference genome, but large insertions of novel DNA represent something of an unknown unknown. If a read doesn't map is it because it's some kind of contamination, or is it because something new has entered into the sample. This tutorial is an attempt to use tools you are already familiar with to identify such novel DNA mutations.
Learning Objectives
Extract reads where one or both reads do not map to reference
de novo assemble reads
Overview of process and relationship to other tutorials
The scope of this tutorial
This tutorial focuses on using several different tools which are presented in more detail elsewhere. If you do not understand why a program is being used, how it works, or are otherwise unsure, the tutorials listed below would be good resources, or as always discuss with instructor.
Performing this type of analysis will require the following steps and considerationos:
Adapters should be trimmed as the presence of adapter sequence can negatively impact both mapping, and assembly.
Reads will be mapped with bowtie2 and unmapped reads isolated.
Consider checking quality of the unmapped reads and possibly using more stringent trimming as non-mapped reads may be unmapped mapped due to low quality rather than having a novel source.
Reads will then be assembled.
None of the tutorials linked above are required to complete this tutorial as you will work with new data and go through each of the steps.
Install spades
You may have already done this in the spades tutorial, in which case you can skip the to the verification below for more information about the potentially unfamiliar installation see Genome Assembly (SPAdes) -- GVA2023#InstallingSPAdes
Install
conda create --name GVA-Assembly-Short -c conda-forge -c bioconda spades=3.15.5Test
conda activate GVA-Assembly-Short
mkdir $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial
spades.py --testCorrect SPAdes output
======= SPAdes pipeline finished WITH WARNINGS!
=== Error correction and assembling warnings:
* 0:00:01.927 1M / 47M WARN General (launcher.cpp : 178) Your data seems to have high uniform coverage depth. It is strongly recommended to use --isolate option.
======= Warnings saved to $SCRATCH/GVA_SPAdes_tutorial/spades_test/warnings.log
SPAdes log can be found here: $SCRATCH/GVA_SPAdes_tutorial/spades_test/spades.log
Thank you for using SPAdes!Anything else should be considered a failed test, do not continue.
Verify
spades.py --versionExpected version output
SPAdes v3.15.5
Identification of a novel plasmid
One example of novel DNA being present is when a given sample may have a virus or plasmid associated with a sample. Here we will take a sample known to have a high copy plasmid associated with it, but map the reads against only the genome. The unmapped reads are then expected to assemble into a plasmid.
Get some data
Possible errors on idev nodes
As mentioned yesterday, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.
mkdir $SCRATCH/GVA_novel_DNA
cd $SCRATCH/GVA_novel_DNA
cp $BI/gva_course/novel_DNA/* .
ls^ above transfers 2 gzipped fastq files and a reference file.
Map read using bowtie2
This is the same process used in the read mapping tutorial, recall we installed bowtie2 in our GVA-bowtie2-mapping environment.
activate conda envrionment
conda activate GVA-bowtie2-mapping bowtie2 --versionYou should see version 2.5.1
Convert reference to fasta
module load bioperl bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fastaIndex the reference
mkdir bowtie2 bowtie2-build --threads 48 CP009273.1_Eco_BW25113.fasta bowtie2/CP009273.1_Eco_BW25113The following command will take ~5 minutes to complete. Before you run the command execute '
bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial. Note that in this case we are passing gzipped (compressed) fastq files to bowtie2 rather than uncompressed fastq files. Storing compressed files saves a significant amount of space, and often can be worked with without needing to directly decompress them.Map reads
bowtie2 --very-sensitive-local -t -p 48 -x bowtie2/CP009273.1_Eco_BW25113 -1 SRR4341249_1.fastq.gz -2 SRR4341249_2.fastq.gz -S bowtie2/SRR4341249-vsl.sam --un-conc SRR4341249-unmapped-vsl.fastq
Assemble unmapped reads
This is the same process used in the plasmid assembly portion of the genome assembly tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information.
Attempt to assemble with plasmidspades.py which expects to find circularized plasmid type sequences. Command expected to take ~15 minutes
run plasmid spades
conda activate GVA-Assembly plasmidspades.py -t 48 -o unmapped_plasmid -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastqAdditionally attempt to assemble with base spades.py command which uses different internal settings to potentially identify different types of novel DNA from our unmapped reads. Command expected to take ~15 minutes.
run plasmid spades
spades.py -t 48 -o unmapped_full -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq