Overview

Throughout the course we have focused on aligning a single sample of reads against a single reference genome with a few exceptions, but sometimes we know that is not the biological case. Sometimes we know that there are multiple different things present in a single sample. Most common situation would be a sample with a chromosome as well as a plasmid. Here we will examine the same sample used in the novel DNA identification tutorial to see how inclusion of the 2nd reference file changes the mapping results.

The discussion of concepts in this tutorial are identical to the advanced mapping with bowtie2 tutorial and work with the same data. The commands in this tutorial will take ~15 minutes. Discussion of results is more thorough in the advanced mapping tutorial. As that tutorial directly assess the difference between multiple mapping types.

Prerequisite required

In order for this tutorial to work, you must have installed your own version of bowtie2 in this tutorial. If you have not the commands below will not run.

To verify these commands will work, use the 'which bowtie2' command and verify that bowtie2 resides in your $WORK directory not /opt

Learning objectives

Understand how mapping against multiple reference genomes simultaneously decreases noise and increases confidence in variant calls.
Call variants in multiple references simultaneously in breseq.

Mapping against multiple references

Why is mapping against multiple references at the same time preferred to mapping against multiple different references 1 at a time? The answer relates to identifying real mutations from errors. As we discussed in our initial mapping tutorial/presentation, mapping scores and alignment scores are both related to how confident the mapping program is that an individual read is mapped to the correct location in the genome, and how well that that read aligns at that location.Imagine a hypothetical situation where in you have a 200bp region of a low copy plasmid that differs from the chromosome by a single base.

If you map against both references at the same time, the mapper will associate each read uniquely to either the plasmid region or the chromosome without having any mismatches.
If you map against the references separately, both will result in 50% of the reads aligned to this region as having a high quality mapping score, and a slightly diminished alignment score for both runs. In the case of clonal haploids, the 50% frequency would be concerning as 50% shouldn't occur under normal circumstances, but the duplication of a region followed by a single base change in one of the copies would produce an identical result.

Get some data

Here we will use the same data as was used in the novel DNA identification tutorial plus an additional reference file associated with the plasmid known to be present.

mkdir $SCRATCH/GVA_advanced_mapping_breseq
cp $BI/gva_course/novel_DNA/* $SCRATCH/GVA_advanced_mapping_breseq
cp $BI/gva_course/advanced_mapping/* $SCRATCH/GVA_advanced_mapping_breseq
cd $SCRATCH/GVA_advanced_mapping_breseq
ls

^ expected 2 fastq files and 2 gbk reference files

Set up the run

Minimal information supplied about breseq here, please see advanced breseq tutorial for more information.

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. It is unlikely that you are currently on an idev node as copying the files while on an idev node seems to be having problems as discussed. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

As a challenge consider changing this command using the information from the advanced breseq tutorial to redirect the output to a log file and put the output in a subfolder that would allow you to make use of the helper script export-breseq and for loop for pulling .gd files to generate a compare table.

module unload bowtie/2.3.4
breseq -j 48 -r CP009273.1_Eco_BW25113.gbk -r GFP_Plasmid_SKO4.gbk SRR4341249_1.fastq SRR4341249_2.fastq

Evaluating the results

Recall the output directory must be transferred back to your local computer with scp to view the html output.

What percent of reads mapped?

Check the summary.html file, ask for help if you don't understand where to look to find the info on that page.

Next steps

Consider running the same breseq command without the 2nd reference. Does it effect the mutations that are called? Does it effect the percent of reads mapped as was observed in the advanced mapping tutorial?

Return to GVA2020