Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

One example of novel DNA being present is when a given sample may have a virus or plasmid associated with a sample. Here we will take a sample known to have a high copy plasmid associated with it, but map the reads against only the genome. The unmapped reads are then expected to assemble into a plasmid.

Get some data

Note
titlePossible errors on idev nodes

As mentioned yesterday, you can not copy from the BioITeam (because it is on corral-repl) while on an idev node. Logout of your idev session, copy the files.


Code Block
languagebash
mkdir $SCRATCH/GVA_novel_DNA
cd $SCRATCH/GVA_novel_DNA
cp $BI/gva_course/novel_DNA/* .
ls

...


  1. Code Block
    languagebash
    titleactivate conda envrionment
    conda activate GVA-bowtie2-mapping
    bowtie2 --version

    You should see version 2.5.1


  2. Code Block
    languagebash
    titleConvert reference to fasta
    module load bioperl
    bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta




  3. Warning
    titleRemember to make sure you are on an idev done

    For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.


    Code Block
    languagebash
    titleIndex the reference
    mkdir bowtie2
    bowtie2-build --threads 6848 CP009273.1_Eco_BW25113.fasta bowtie2/CP009273.1_Eco_BW25113



  4. The following command will take ~5 minutes to complete. Before you run the command execute 'bowtie2 -h' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial. Note that in this case we are passing gzipped (compressed) fastq files to bowtie2 rather than uncompressed fastq files. Storing compressed files saves a significant amount of space, and often can be worked with without needing to directly decompress them.

    Code Block
    languagebash
    titleMap reads
    bowtie2 --very-sensitive-local -t -p 6848 -x bowtie2/CP009273.1_Eco_BW25113 -1 SRR4341249_1.fastq.gz -2 SRR4341249_2.fastq.gz -S bowtie2/SRR4341249-vsl.sam --un-conc SRR4341249-unmapped-vsl.fastq


    Expand
    titleClick here to explain the new options.


    optioneffect
    --very-sensitive-local
    map in very sensitive local alignment mode
    -t
    print time info
    -p 6848
    use 68 48 processors
    -x bowtie2/CP009273.1_Eco_BW25113
    index
    -1 SRR4341249_1.fastq.gz 
    -2 SRR4341249_2.fastq.gz
    reads used for mapping
    -S bowtie2/SRR4341249-vsl.sam
    Sam file detailing mapping
    --un-conc SRR4341249-unmapped-vsl.fastq
    print reads which do not map to the genome to the file
    SRR4341249-unmapped-vsl.fastq



    Info
    titleAdditional information about local and global alignment

    Additional general information and examples of how these modes are different can be found here https://www.majordifferences.com/2016/05/difference-between-global-and-local.html

    Information specific to bowtie2's handling of these modes can be found here https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml#end-to-end-alignment-versus-local-alignment


    Expand
    titleWhat percent of reads mapped?

    The stdoutput of the program listed:

    66.62% overall alignment rate


    Expand
    titleHow many reads are in our unmapped fastq file?

    grep -c "^+SRR4341249" SRR4341249-unmapped-vsl.1.fastq 

    returns: 441835

    Note that head or tail of the fastq file shows that our 3rd line is not just "+", hence why we have to add the additional sequence ID to the grep command



...

  1. Attempt to assemble with plasmidspades.py which expects to find circularized plasmid type sequences. Command expected to take ~15 minutes

    Code Block
    languagebash
    titlerun plasmid spades
    conda activate GVA-Assembly
    plasmidspades.py -t 6848 -o unmapped_plasmid -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq


  2. Additionally attempt to assemble with base spades.py command which uses different internal settings to potentially identify different types of novel DNA from our unmapped reads. Command expected to take ~15 minutes. 

    Code Block
    languagebash
    titlerun plasmid spades
    spades.py -t 6848 -o unmapped_full -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq


...

Expand
titleYou have likely worked through enough of these turtorials to notice a glaring error in the way this analysis was done. 

The reads were not trimmed. As a bonus exercise you could trim these reads before redoing the analysis and see how it effects alignment fraction, and assembly statistics.


Return to GVA2022GVA2023 course page.