Table of Contents |
---|
Overview
As has been mentioned several times, variants are anything that is different from a reference genome, but large insertions of novel DNA represent something of an unknown unknown. If a read doesn't map is it because its some kind of contamination, or is it because something new has entered into the sample. This tutorial is an attempt to use tools you are already familiar with to identify such novel DNA mutations. As this is a new tutorial any feedback is very welcome and helpful.
Note | ||
---|---|---|
| ||
This tutorial assumes you have already installed spades in the assembly tutorial. Verify you have done so with the 'spades.py -v' command does not return 3.13.0, need to at least complete enough of that tutorial that spades is installed. |
Learning Objectives
- Extract reads where one or both reads do not map to reference
- de novo assemble reads
Relationship to other tutorials
- As presence of adapter can cause problems with assembly, make sure adapters have been trimmed
- Reads will be mapped with bowtie2
- non-mapped reads will be extracted, consider checking quality of these reads and possible additional trimming
- Reads will then be assembled
None of these other tutorials are required to complete this tutorial, but additional information about individual steps may be found there.
Identification of a novel plasmid
One example of novel DNA being present is when a given sample may have a virus or plasmid associated with a sample. Here we will take a sample known to have a high copy plasmid associated with it, but map the reads against only the genome. Unaligned reads would then be expected to be able to assemble into a plasmid.
Get some data
Code Block | ||
---|---|---|
| ||
mkdir $SCRATCH/GVA_novel_DNA
cd $SCRATCH/GVA_novel_DNA
cp $BI/gva_course/novel_DNA/* .
ls |
^ above transfers 2 fastq files and a reference file.
Map read using bowtie2
This is the same process used in the read mapping tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information.
Code Block language bash title Convert reference to fasta module load bioperl bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta
Warning title Remember to make sure you are on an idev done For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.
Code Block language bash title Index the reference mkdir bowtie2 module load bowtie/2.3.4 bowtie2-build CP009273.1_Eco_BW25113.fasta bowtie2/CP009273.1_Eco_BW25113
The following command will take ~5 minutes to complete. Before you run the command execute '
bowtie2 -h
' so while the command is running you can try to figure out what the different options are doing that we did not include in our first tutorial.Code Block language bash title Map reads bowtie2 --very-sensitive-local -t -p 48 -x bowtie2/CP009273.1_Eco_BW25113 -1 SRR4341249_1.fastq -2 SRR4341249_2.fastq -S bowtie2/SRR4341249-vsl.sam --un-conc SRR4341249-unmapped-vsl.fastq
Expand title Click here to explain the new options. option effect --very-sensitive-local
map in very sensitive local alignment mode -t
print time info -p 48
use 48 processors -x bowtie2/CP009273.1_Eco_BW25113
index -1 SRR4341249_1.fastq
-2 SRR4341249_2.fastqreads used for mapping -S bowtie2/SRR4341249-vsl.sam
Sam file detailing mapping --un-conc SRR4341249-unmapped-vsl.fastq
print reads which do not map to the genome to the file SRR4341249-unmapped-vsl.fastq
Expand title What percent of reads mapped? The stdoutput of the program listed:
65.74% overall alignment rate
Assemble unmapped reads
This is the same process used in the plasmid assembly portion of the genome assembly tutorial, and therefore presented with little comment except to remark on differences. Refer to that tutorial for more in-depth information.
Attempt to assemble with plasmidspades.py which expects to find circularized plasmid type sequences. Command expected to take ~5 minutes
Code Block language bash title run plasmid spades plasmidspades.py -t 48 -o unmapped_plasmid -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq
Additionally attempt to assemble with base spades.py command which uses different internal settings to potentially identify different types of novel DNA from our unmapped reads. Command expected to take ~5 minutes.
Code Block language bash title run plasmid spades spades.py -t 48 -o unmapped_full -1 SRR4341249-unmapped-vsl.1.fastq -2 SRR4341249-unmapped-vsl.2.fastq
Compare contigs generated from different assemblies
Recall from genome assembly tutorial the file we are most interested in is the contigs.fasta file in each output directory (unmapped_plasmid and unmapped_full).
Expand title How many contigs were generated in each case? 2 for the plasmid and 16 for the full
Code Block language bash title Where does the answer come from? collapse true grep -c "^>" unmapped*/contigs.fasta
Expand title Are any of the contigs the same? Yes, both contigs detected in the plasmid mode were also detected in the full mode
Code Block language bash title Where does the answer come from? collapse true grep "^>" unmapped*/contigs.fasta # lists identical lengths and coverages for 2 plasmid contig
Expand title What sizes are the contigs? 5441
3170
14 others less than 500bp
Code Block language bash title Where does the answer come from? collapse true grep "^>" unmapped*/contigs.fasta # same command as above, just focusing on the length value
Expand title which is most likely to be our plasmid? The contig that is 3170.
Code Block language bash title Where does the answer come from? collapse true # From above, I stated this was a high copy plasmid, it has a coverage value of 12,698 compared to 98 for the larger contig
Expand title Is that actually our plasmid? Yes! The actual plasmid locus line stats:
LOCUS GFP_Plasmid_Sequ 3115 bp DNA circular UNA 18-NOV-2013
Expand title Why might the sizes not agree? My thought is that this was raw fastq files that were fed into the assembly, not trimmed files. Leads me to hypothesize that the difference in size is related to the presence of adapter sequence.
Next steps
Here we have presented a proof of concept that unmapped reads can be used to find something that we actually did know was present. We also found something that was even longer that wasn't expected.
Additional questions are:
- How might we go about finding out what an assembled product actually is when it truly is novel rather than a positive control? ... blast
- How would we decide if it was real or important? ... depends on blast results, how high coverage is compared to genome, gene content
- In other systems what else might you find? ... viruses, mobile genetic elements, evidence of microbiome
- How does this effect mapping? ... consider tomorrow's read mapping with multiple references tutorial
- Do you expect to find more novel DNA in a highly accurate reference file, or a "similar' reference file? ... similar