Novel DNA Identification -- GVA2023

Overview

As has been mentioned several times, variants are anything that is different from a reference genome, but large insertions of novel DNA represent something of an unknown unknown. If a read doesn't map is it because it's some kind of contamination, or is it because something new has entered into the sample. This tutorial is an attempt to use tools you are already familiar with to identify such novel DNA mutations.


Learning Objectives

  1. Extract reads where one or both reads do not map to reference
  2. de novo assemble reads

Overview of process and relationship to other tutorials

The scope of this tutorial

This tutorial focuses on using several different tools which are presented in more detail elsewhere. If you do not understand why a program is being used, how it works, or are otherwise unsure, the tutorials listed below would be good resources, or as always discuss with instructor. 

Performing this type of analysis will require the following steps and considerationos:

  1. Adapters should be trimmed as the presence of adapter sequence can negatively impact both mapping, and assembly.
  2. Reads will be mapped with bowtie2 and unmapped reads isolated.
  3. Consider checking quality of the unmapped reads and possibly using more stringent trimming as non-mapped reads may be unmapped mapped due to low quality rather than having a novel source.
  4. Reads will then be assembled.

None of the tutorials linked above are required to complete this tutorial as you will work with new data and go through each of the steps.

Install spades

You may have already done this in the spades tutorial, in which case you can skip the to the verification below for more information about the potentially unfamiliar installation see Genome Assembly (SPAdes) -- GVA2023#InstallingSPAdes

Install
conda create --name GVA-Assembly-Short -c conda-forge -c bioconda spades=3.15.5
Test
conda activate GVA-Assembly-Short
mkdir $SCRATCH/GVA_SPAdes_tutorial
cd $SCRATCH/GVA_SPAdes_tutorial
spades.py --test
Correct SPAdes output
======= SPAdes pipeline finished WITH WARNINGS!

=== Error correction and assembling warnings:
 * 0:00:01.927     1M / 47M   WARN    General                 (launcher.cpp              : 178)   Your data seems to have high uniform coverage depth. It is strongly recommended to use --isolate option.
======= Warnings saved to $SCRATCH/GVA_SPAdes_tutorial/spades_test/warnings.log

SPAdes log can be found here: $SCRATCH/GVA_SPAdes_tutorial/spades_test/spades.log

Thank you for using SPAdes!

Anything else should be considered a failed test, do not continue.

Verify
spades.py --version
Expected version output
SPAdes v3.15.5


Identification of a novel plasmid

One example of novel DNA being present is when a given sample may have a virus or plasmid associated with a sample. Here we will take a sample known to have a high copy plasmid associated with it, but map the reads against only the genome. The unmapped reads are then expected to assemble into a plasmid.

Get some data

Possible errors on idev nodes

As mentioned yesterday, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.

mkdir $SCRATCH/GVA_novel_DNA
cd $SCRATCH/GVA_novel_DNA
cp $BI/gva_course/novel_DNA/* .
ls

^ above transfers 2 gzipped fastq files and a reference file.



Map read using bowtie2

This is the same process used in the read mapping tutorial, recall we installed bowtie2 in our GVA-bowtie2-mapping environment.

  1. activate conda envrionment
    conda activate GVA-bowtie2-mapping
    bowtie2 --version

    You should see version 2.5.1

  2. Convert reference to fasta
    module load bioperl
    bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta



  3. Remember to make sure you are on an idev done

    For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

    Index the reference
    mkdir bowtie2
    bowtie2-build --threads 48 CP009273.1_Eco_BW25113.fasta bowtie2/CP009273.1_Eco_BW25113



  4. The following command will take ~5 minutes to complete. Before you run the command execute 'bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial. Note that in this case we are passing gzipped (compressed) fastq files to bowtie2 rather than uncompressed fastq files. Storing compressed files saves a significant amount of space, and often can be worked with without needing to directly decompress them.

    Map reads
    bowtie2 --very-sensitive-local -t -p 48 -x bowtie2/CP009273.1_Eco_BW25113 -1 SRR4341249_1.fastq.gz -2 SRR4341249_2.fastq.gz -S bowtie2/SRR4341249-vsl.sam --un-conc SRR4341249-unmapped-vsl.fastq
     Click here to explain the new options.
    optioneffect
    --very-sensitive-local
    map in very sensitive local alignment mode
    -t
    print time info
    -p 48
    use 48 processors
    -x bowtie2/CP009273.1_Eco_BW25113
    index
    -1 SRR4341249_1.fastq.gz 
    -2 SRR4341249_2.fastq.gz
    reads used for mapping
    -S bowtie2/SRR4341249-vsl.sam
    Sam file detailing mapping
    --un-conc SRR4341249-unmapped-vsl.fastq
    print reads which do not map to the genome to the file
    SRR4341249-unmapped-vsl.fastq

    Additional information about local and global alignment

    Additional general information and examples of how these modes are different can be found here https://www.majordifferences.com/2016/05/difference-between-global-and-local.html

    Information specific to bowtie2's handling of these modes can be found here https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment

     What percent of reads mapped?

    The stdoutput of the program listed:

    66.62% overall alignment rate

     How many reads are in our unmapped fastq file?

    grep -c "^+SRR4341249" SRR4341249-unmapped-vsl.1.fastq 

    returns: 441835

    Note that head or tail of the fastq file shows that our 3rd line is not just "+", hence why we have to add the additional sequence ID to the grep command



Assemble unmapped reads

This is the same process used in the plasmid assembly portion of the genome assembly tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information.

  1. Attempt to assemble with plasmidspades.py which expects to find circularized plasmid type sequences. Command expected to take ~15 minutes

    run plasmid spades
    conda activate GVA-Assembly
    plasmidspades.py -t 48 -o unmapped_plasmid -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq
  2. Additionally attempt to assemble with base spades.py command which uses different internal settings to potentially identify different types of novel DNA from our unmapped reads. Command expected to take ~15 minutes. 

    run plasmid spades
    spades.py -t 48 -o unmapped_full -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq

Compare contigs generated from different assemblies

Recall from genome assembly tutorial the file we are most interested in is the contigs.fasta file in each output directory (unmapped_plasmid and unmapped_full).

  1.  How many contigs were generated in each case?

    2 for the plasmid and 16 for the full

    Where does the answer come from?
    grep -c "^>" unmapped*/contigs.fasta
  2.  Are any of the contigs the same?

    Probably, both contigs detected in the plasmid mode have similar lengths in the full mode

    Where does the answer come from?
    grep "^>" unmapped*/contigs.fasta
    # lists identical lengths and coverages for 2 plasmid contig
  3.  What sizes are the contigs?

    5441 or 5463

    3170 or 3192

    14 others less than 500bp in full spade mode

    Where does the answer come from?
    grep "^>" unmapped*/contigs.fasta
    # same command as above, just focusing on the length value
  4.  which is most likely to be our plasmid?

    The contig that is 3170 or 3192

    Where does the answer come from?
    # From above, I stated this was a high copy plasmid, it has a coverage value of 12,698 compared to 98 for the larger contig
  5.  Is that actually our plasmid?

    Yes! The actual plasmid reference locus line stats:

    LOCUS       GFP_Plasmid_Sequ        3115 bp    DNA     circular UNA 18-NOV-2013

     Why might the sizes not agree?

    My thought is that this was raw fastq files that were fed into the assembly, not trimmed files. Leads me to hypothesize that the difference in size is related to the presence of adapter sequence. Alternatively, it may be that small changes in either bowtie2 or spades versions or read trimmers have influenced what reads are considered. In fact using bowtie 2.3.5.1 and spades 3.13.0 both mappers gave the same stats on these 2 conditions for the largest 2 plasmids (3170 and 5441)


Next steps

Here we have presented a proof of concept that unmapped reads can be used to find something that we actually did know was present. We also found something that was even longer that wasn't expected.

Additional questions are:

  1.  How might we go about finding out what an assembled product actually is when it truly is novel rather than a positive control?

    blast. In fact I did just that and identified it as an artifact of sequencing. The contig corresponds to phiX.

    Steps to identify phiX
    copy full 5441 bp of sequence
    Go to https://blast.ncbi.nlm.nih.gov/Blast.cgi
    large list of results, including vast majority listing phiX or genome assembly/scaffold

    Why does seeing phiX (link to screen shot of blast results) tell me that it is an artifact? phiX is used as a loading control for illumina runs to both tell the difference between a failed run because of bad libraries and a failed run due to poor base diversity.

  2.  How would we decide if it was real or important if we hadn't recognized it?
    Depends on blast results, how high coverage is compared to genome, gene content
  3.  In other systems what else might you find?
    viruses, mobile genetic elements, evidence of microbiome, mitochondria, chloroplast, other plasmids
  4.  How does this effect mapping?
  5.  Do you expect to find more novel DNA in a highly accurate reference file, or a "similar' reference file?
    Similar. The fact that the reference is not as accurate will lower the alignment scores across the board, potentially dropping below thresholds to be able to anchor the match at all. Look deeper at the bowtie2 mapping command where you used --very-sensative-local mode the documentation tells you about tolerated mismatches etc. The more reads you have that don't match, the more novel DNA inserts you are likely to deal with.

Next steps:

 You have likely worked through enough of these turtorials to notice a glaring error in the way this analysis was done. 

The reads were not trimmed. As a bonus exercise you could trim these reads before redoing the analysis and see how it effects alignment fraction, and assembly statistics.


Return to GVA2023 course page.