Advanced bowtie2 -- GVA2021
Overview
Throughout the course we have focused on aligning a single sample of reads against a single reference genome with a few exceptions, but sometimes we know that is not the biological case. Sometimes we know that there are multiple different things present in a single sample. Most common situation would be a sample with a chromosome as well as a plasmid. Here we will examine the same sample used in the novel DNA identification tutorial to see how inclusion of the 2nd reference file changes the mapping results.
The discussion of concepts in this tutorial are identical to the breseq with multiple refs tutorial and work with the same data. Discussion of results is different in both tutorials.
Learning objectives
- Understand how mapping against multiple reference genomes simultaneously decreases noise and increases confidence in variant calls.
- Map reads against multiple references, and evaluate mapping success/frequency.
Mapping against multiple references
Why is mapping against multiple references at the same time preferred to mapping against multiple different references 1 at a time? The answer relates to identifying real mutations from errors. As we discussed in our initial mapping tutorial/presentation, mapping scores and alignment scores are both related to how confident the mapping program is that an individual read is mapped to the correct location in the genome, and how well that that read aligns at that location.Imagine a hypothetical situation where in you have a 200bp region of a low copy plasmid that differs from the chromosome by a single base.
- If you map against both references at the same time, the mapper will associate each read uniquely to either the plasmid region or the chromosome without having any mismatches.
- If you map against the references separately, both will result in 50% of the reads aligned to this region as having a high quality mapping score, and a slightly diminished alignment score for both runs. In the case of clonal haploids, the 50% frequency would be concerning as 50% shouldn't occur under normal circumstances, but the duplication of a region followed by a single base change in one of the copies would produce an identical result.
Get some data
Here we will use the same data as was used in the novel DNA identification tutorial plus an additional reference file associated with the plasmid known to be present.
mkdir $SCRATCH/GVA_advanced_mapping cp $BI/gva_course/novel_DNA/* $SCRATCH/GVA_advanced_mapping cp $BI/gva_course/advanced_mapping/* $SCRATCH/GVA_advanced_mapping cd $SCRATCH/GVA_advanced_mapping ls
^ expected 2 fastq files and 2 gbk reference files
Set up the run
Hopefully by now, it will not surprise you that we will have a 3 step process: converting references to fasta, indexing the references, and mapping the reads. If it does recall the read mapping tutorial, and the novel DNA identification tutorial for additional information and examples. Here, less description of individual steps will be given.
- Convert reference to fasta
module load bioperl bp_seqconvert.pl --from genbank --to fasta < CP009273.1_Eco_BW25113.gbk > CP009273.1_Eco_BW25113.fasta bp_seqconvert.pl --from genbank --to fasta < GFP_Plasmid_SKO4.gbk > GFP_Plasmid_SKO4.fasta
Recall that fasta files can have multiple sequence entries
And thus we could combine our new fasta files using the cat command and piping to create a single file containing both reference sequences. In that case, the same procedure as was used in the original read mapping tutorial could be followed and will produce the same results as obtained here. Besides making it less obvious how bowtie2 is handling multiple reference sequences and semi defeating the purpose of this tutorial. I think it better practice to keep the reference files separ
Welcome to the University Wiki Service! Please use your IID (yourEID@eid.utexas.edu) when prompted for your email address during login or click here to enter your EID. If you are experiencing any issues loading content on pages, please try these steps to clear your browser cache. If you require further assistance, please email wikihelp@utexas.edu.