/
Identifying mutations in microbial genomes (breseq)

Identifying mutations in microbial genomes (breseq)

Introduction

breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.

Input data / expectations:

  • Haploid reference genome
  • Relatively small (<20 Mb) reference genome
  • Input FASTQ reads can be from any sequencing technology
  • Average genomic coverage > 20-fold
  • Less than ~1,000 mutations expected
  • Detects SNVs and structural variants from single-end reads
  • Produces annotated HTML output

You can learn a great deal more about breseq by reading the Online Documentation.

Here is a rough outline of the workflow in breseq with proposed additions.

Install breseq

Download breseq from Google code

See if you can install breseq and get it running from the installation instructions.

You will need Bowtie version 2.0.0-beta7 or later to run breseq. The version available on TACC by module laod is currently not this new.

 I need help...

Hint: The previous lesson on Installing Linux tools should help you get bowtie2 and breseq installed. A suitable version of R is already installed on TACC. Remember that you can load that using the command:

module load R

Example 1: Bacteriophage lambda data set

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Data

The data files for this example are in the path:

$BI/ngs_course/lambda_mixed_pop/data

Copy this directory to your $SCRATCH space. Name it something other than data. And cd into it.

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes. Submit this command to the TACC development queue.

breseq -r lambda.gbk lambda_mixed_population.fastq > log.txt

A bunch of progress messages will stream by during the breseq run. They detail several steps in a pipeline that combines the steps of mapping (using SSAHA2), variant calling, annotating mutations, etc. You can examine them by peeking in the log.txt file as your job runs using tail -f. The -f option means to "follow" the file and keep giving you output from it as it gets bigger. You will need to wait for your job to start running before you can tail -f log.txt.

Looking at breseq predictions

breseq will produce a lot of directories beginning 01_sequence_conversion, 02_reference_alignment, ... Each of these contains intermediate files that can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on.

breseq will also produce two directories called: data and output.

First, copy the output directory back to your desktop computer.

 Need some help?

If you use scp then you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC:

login1$ pwd

Then try a command like this on your desktop:

desktop1$ scp -r username@lonestar.tacc.utexas.edu:/the/directory/returned/by/pwd/output .

It would be even better practice to archive and gzip the output directory before copying it using tar -cvzf to archive. Then copying that file and using tar -xvzf to unarchive it.

Inside of the output directory is a file called index.html. Open this in a web browser on your desktop and click around to take a look at the mutation predictions and summary information.

Optional Exercise: Running breseq in mixed population mode

The data set you are examining is actually of a mixed population of many different phage lambda genotypes descended from a clonal ancestor. You have run breseq in a mode where it is predicting consensus mutations in what it thinks is one uniform haploid genome. Actually, some individuals in the population have certain mutations and others do not, so you might have noticed when you looked at some of the alignments that there was a mixture of bases at a position.

As an optional exercise, you can use a somewhat experimental feature of breseq to run in a mode where it estimates the frequencies of different mutations in the population. This process is most accurate for single nucleotide variants. Mutations at intermediate frequencies are not (yet) predicted for classes of mutations like large structural variants.