Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
/
Novel DNA Identification -- GVA2021

    Novel DNA Identification -- GVA2021

    Jun 16, 2021

    Overview

    As has been mentioned several times, variants are anything that is different from a reference genome, but large insertions of novel DNA represent something of an unknown unknown. If a read doesn't map is it because its some kind of contamination, or is it because something new has entered into the sample. This tutorial is an attempt to use tools you are already familiar with to identify such novel DNA mutations.

    Prerequisite required

    This tutorial assumes you have already installed spades in the assembly tutorial. Verify you have done so with the 'spades.py -v' command does not return 3.13.0, need to at least complete enough of that tutorial that spades is installed. Genome Assembly (SPAdes) -- GVA2021#InstallingSPAdes


    Learning Objectives

    1. Extract reads where one or both reads do not map to reference
    2. de novo assemble reads

    Relationship to other tutorials

    1. As presence of adapter can cause problems with assembly, make sure adapters have been trimmed
    2. Reads will be mapped with bowtie2
    3. non-mapped reads will be extracted, consider checking quality of these reads and possible additional trimming as non-mapped reads may have not mapped due to low quality, and thus could be further improved.
    4. Reads will then be assembled

    None of these other tutorials are required to complete this tutorial, but additional information about individual steps may be found there.

    Identification of a novel plasmid

    One example of novel DNA being present is when a given sample may have a virus or plasmid associated with a sample. Here we will take a sample known to have a high copy plasmid associated with it, but map the reads against only the genome. Unaligned reads would then be expected to be able to assemble into a plasmid.

    Get some data

    mkdir $SCRATCH/GVA_novel_DNA
    cd $SCRATCH/GVA_novel_DNA
    cp $BI/gva_course/novel_DNA/* .
    ls

    ^ above transfers 2 fastq files and a reference file.



    Map read using bowtie2

    This is the same process used in the read mapping tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information. Recall we installed bowtie2 in our GVA2021 environment

    1. activate conda envrionment
      conda activate GVA2021
      bowtie2 --version

      You should see version 2.3.5.1

    2. Convert reference to fasta
      module load bioperl
      bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta



    3. Remember to make sure you are on an idev done

      For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

      Index the reference
      mkdir bowtie2
      bowtie2-build CP009273.1_Eco_BW25113.fasta bowtie2/CP009273.1_Eco_BW25113



    4. The following command will take ~7 minutes to complete. Before you run the command execute 'bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial.

      Map reads
      bowtie2 --very-sensitive-local -t -p 68 -x bowtie2/CP009273.1_Eco_BW25113 -1 SRR4341249_1.fastq -2 SRR4341249_2.fastq -S bowtie2/SRR4341249-vsl.sam --un-conc SRR4341249-unmapped-vsl.fastq
       Click here to explain the new options.
      optioneffect
      --very-sensitive-local
      map in very sensitive local alignment mode
      -t
      print time info
      -p 68
      use 68 processors
      -x bowtie2/CP009273.1_Eco_BW25113
      index
      -1 SRR4341249_1.fastq 
      -2 SRR4341249_2.fastq
      reads used for mapping
      -S bowtie2/SRR4341249-vsl.sam
      Sam file detailing mapping
      --un-conc SRR4341249-unmapped-vsl.fastq
      print reads which do not map to the genome to the file
      SRR4341249-unmapped-vsl.fastq
       What percent of reads mapped?

      The stdoutput of the program listed:

      65.74% overall alignment rate

       How many reads are in our unmapped fastq file?

      grep -c "^+SRR4341249"  SRR4341249-unmapped-vsl.1.fastq

      returns: 441842

      Note that head or tail of the fastq file shows that our 3rd line is not just "+", hence why we have to add the additional sequence ID to the grep command



    Assemble unmapped reads

    This is the same process used in the plasmid assembly portion of the genome assembly tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information.

    1. Attempt to assemble with plasmidspades.py which expects to find circularized plasmid type sequences. Command expected to take ~15 minutes

      run plasmid spades
      conda activate GVA-SPAdes
      plasmidspades.py -t 68 -o unmapped_plasmid -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq
    2. Additionally attempt to assemble with base spades.py command which uses different internal settings to potentially identify different types of novel DNA from our unmapped reads. Command expected to take ~15 minutes. 

      run plasmid spades
      spades.py -t 68 -o unmapped_full -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq

    Compare contigs generated from different assemblies

    Recall from genome assembly tutorial the file we are most interested in is the contigs.fasta file in each output directory (unmapped_plasmid and unmapped_full).

    1.  How many contigs were generated in each case?

      2 for the plasmid and 16 for the full

      Where does the answer come from? Expand source
      grep -c "^>" unmapped*/contigs.fasta
    2.  Are any of the contigs the same?

      Yes, both contigs detected in the plasmid mode were also detected in the full mode

      Where does the answer come from? Expand source
      grep "^>" unmapped*/contigs.fasta
      # lists identical lengths and coverages for 2 plasmid contig
    3.  What sizes are the contigs?

      5441

      3170

      14 others less than 500bp

      Where does the answer come from? Expand source
      grep "^>" unmapped*/contigs.fasta
      # same command as above, just focusing on the length value
    4.  which is most likely to be our plasmid?

      The contig that is 3170.

      Where does the answer come from? Expand source
      # From above, I stated this was a high copy plasmid, it has a coverage value of 12,698 compared to 98 for the larger contig
    5.  Is that actually our plasmid?

      Yes! The actual plasmid reference locus line stats:

      LOCUS       GFP_Plasmid_Sequ        3115 bp    DNA     circular UNA 18-NOV-2013

       Why might the sizes not agree?

      My thought is that this was raw fastq files that were fed into the assembly, not trimmed files. Leads me to hypothesize that the difference in size is related to the presence of adapter sequence.


    Next steps

    Here we have presented a proof of concept that unmapped reads can be used to find something that we actually did know was present. We also found something that was even longer that wasn't expected.

    Additional questions are:

    1.  How might we go about finding out what an assembled product actually is when it truly is novel rather than a positive control?

      blast. In fact I did just that and identified it as an artifact of sequencing. The contig corresponds to phiX.

      Steps to identify phiX
      copy full 5441 bp of sequence
      Go to https://blast.ncbi.nlm.nih.gov/Blast.cgi
      large list of results, including vast majority listing phiX or genome assembly/scaffold

      Why does seeing phiX (link to screen shot of blast results) tell me that it is an artifact? phiX is used as a loading control for illumina runs to both tell the difference between a failed run because of bad libraries and a failed run due to poor base diversity.

    2.  How would we decide if it was real or important if we hadn't recognized it?
      Depends on blast results, how high coverage is compared to genome, gene content
    3.  In other systems what else might you find?
      viruses, mobile genetic elements, evidence of microbiome, mitochondria, chloroplast
    4.  How does this effect mapping?
      Consider advanced read mapping with multiple references tutorial
    5.  Do you expect to find more novel DNA in a highly accurate reference file, or a "similar' reference file?
      Similar. The fact that the reference is not as accurate will lower the alignment scores across the board, potentially dropping below thresholds to be able to anchor the match at all. Look deeper at the bowtie2 mapping command where you used --very-sensative-local mode the documentation tells you about tolerated mismatches etc. The more reads you have that don't match, the more novel DNA inserts you are likely to deal with.


    Return to GVA2021

    , multiple selections available,

    Confluence Documentation | Web Privacy Policy | Web Accessibility

    University Wiki Service

    Bioinformatics Team (BioITeam) at the University of Texas
    • File lists
      File lists
       This trigger is hidden
    • How-to articles
      How-to articles
       This trigger is hidden
    Results will update as you type.
    • Bioinformatics Services
    • CBRS Mini Symposium
    • BCG Full Service Pipelines
    • zArchive
      • TACC stuff
      • Old Home page
      • Old BCG Full Service Pipelines
      • Training
        • Bioinformatics Courses and Content
          • Genome Variant Analysis Course 2014
          • Genome Variant Analysis Course 2015
          • Genome Variant Analysis Course 2016
          • Genome Variant Analysis Course 2017
          • Genome Variant Analysis Course 2019
          • Genome Variant Analysis Course 2020
          • Genome Variant Analysis Course 2021
            • Accessing stampede2 via ssh on MacOs
            • Accessing stampede2 via ssh on Windows10
            • Advanced bowtie2 -- GVA2021
            • Advanced breseq GVA2021
            • Analyzing Annovar Annotation Output -- GVA2021
            • Annovar Annotations -- GVA2021
            • Bedtools tutorial -- GVA2021
            • Breseq with multiple references GVA2021
            • Copying files from TACC to your personal computer.
            • Day 1 Environment Checkup and Catchup
            • Evaluating and processing raw sequencing data GVA2021
            • Exome Capture Metrics GVA2021
            • Genome Analysis Toolkit (GATK) . -- GVA2021
            • Genome Assembly (SPAdes) -- GVA2021
            • GVA2021 - Class Review
            • Human Trios -- GVA2021
            • IGV Tutorial -- GVA2021
            • launcher_creator GVA2021 -- unfinished
            • Launching an iDev session GVA2021
            • Linux and stampede2 Setup -- GVA2021
            • Molecular Index Error correction GVA2021
            • MultiQC - fastQC summary tool -- GVA2021
            • Novel DNA Identification -- GVA2021
            • Read Mapping with bowtie2 Tutorial GVA2021
            • Running Annovar 2021
            • Single Nucleotide Variant (SNV) calling Tutorial GVA2021
            • SSCS vs Trimmed Read Variant calls GVA2021
            • Stampede2 Breseq Tutorial GVA2021
            • Structural Variant (SV) calling with SVdetect 2021
            • Trimmomatic - GVA2021
          • Genome Variant Analysis Course 2022
          • Introduction to NGS Short Course 2015
          • Introduction to RAD-seq 2018
          • Introduction to RAD-seq short course 2017
          • Introduction to RNA Seq Course 2014
          • Introduction to RNA Seq Short Course Commands
          • Introduction To TACC Short Course Cheats
          • NGS RNA-seq exploration short course
          • Introduction to TACC (Video Tutorials):
          • Genome Variant Analysis Course 2023
          • Introduction to RNA Seq Course
        • SSC Intro to NGS Bioinformatics Course
        • BME 383J Course Content
        • Obsolete NGS course materials
        • Relevant classes offered at UT System Schools
        • Short Training Topics
      • Software
      • How to join the BioITeam
      • File lists
      • Bioinformatic Jobs at UT
      • Wish list
      • How-to articles
      • Appsoma-based pipeline development
    • Byte Club
    • Introduction to Biocomputing: From files to functions to plots
    • R/RStudio Part I
      Calendars

    You‘re viewing this with anonymous access, so some content might be blocked.
    {"serverDuration": 12, "requestCorrelationId": "48b42319bb33428eb1666dfb24637209"}