1 Overview
2 Learning objectives:
3 Input data / expectations:
4 breseq access
- 4.1 Check that you have access to breseq
5 Bacteriophage lambda data set
6 Bacteriophage lambda data set repeated
- 6.1 Data, and running breseq
  - 6.1.1 Commands to copy the input data from the first breseq run to a new folder, and rerun breseq on the same fastq and reference file in polymorphism mode. Since this copy command is between 2 scratch locations i doubt there will be issues with it, but remember to restart an idev node if you experience difficulties
- 6.2 Evaluating output
  - 6.2.1 suggested compression command to prepare a single compressed. directory for transfer. This is similar to what we used for the IGV tutorial
  - 6.2.2 Command to type in the desktop's terminal window to decompress the transferred archive after running the scp command
7 E. coli data from Mapping, SNV tutorials:
- - 7.1.1 partial scp command to copy to the current directory of your local computer
- 7.2 data
- 7.3 Running breseq
  - 7.3.1 breseq command
- 7.4 evaluating output
8 Additional tutorials dealing with breseq
9 Additional information on analyzing the output

Overview

breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.

Learning objectives:

Quick introduction to a self contained/automated pipeline to identify mutations.
Explain the types of mutations found in a complete manner before using methods better suited for higher order organisms.
Examine the same data used in the Mapping, and SNV tutorials as breseq output.

Input data / expectations:

Haploid reference genome
Relatively small (<20 Mb) reference genome
Average genomic coverage > 30-fold
Less than ~1,000 mutations expected
Detects SNVs and SVs from split read alignment of reads (does not use paired-end distance information)
Produces annotated HTML output

You can learn a great deal more about breseq by reading the Online Documentation.

Here is a rough outline of the workflow in breseq with proposed additions.

This does mean that breseq is not suited for diploids, and other very large genomes. GATK is a much larger pipeline that has many more additional options. While a tutorial for it can be found here, and students working with human data have still had positive feedback about the remainder of this tutorial.

breseq access

In order to run breseq, we need to make sure breseq was made available to you when we set up your .bashrc file on the first day.

Check that you have access to breseq

tacc:~$ which breseq
# expected output: /corral-repl/utexas/BioITeam/breseq/bin/breseq
tacc:~$ breseq --version
# expected output: breseq 0.35.1

breseq should now run using the breseq command. breseq without any options will show you what the command expectations are.

A consolidated explanation of help files and my experience with them

Not all programs are configured to tell you what it expects just from typing the name of it. Some require the name of the command followed by one of the following: -h or --help or ? while others require preceding the command name with "man" (short for manual). If all that fails google is your friend for all programs not named "R" ... google is still your best bet, but it won't be your friend. In the specific case of R, adding the word stats somewhere to the search will greatly help things in my experience.

Bacteriophage lambda data set

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Data

The data files for this tutorial is located in following location:

$BI/ngs_course/lambda_mixed_pop/data/

Copy the contents of this directory to a new directory called GVA_breseq_lambda_mixed_pop in your scratch directory.

mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop
cp $BI/ngs_course/lambda_mixed_pop/data/* $SCRATCH/GVA_breseq_lambda_mixed_pop

Possible errors on idev nodes

As mentioned over zoom this is one instance that i know for sure copying these files while on an idev node may not work giving Input/Output errors. If you are already on an idev session and this does not work, just use the logout command to exit the idev session and retry the copy command. If both methods fail, please get my attention so we can figure out what is going on.

By a similar token if you actually are on an idev node and are able to transfer the files, please let me know as it may help figure out what the real source of the error is.

ls $SCRATCH/GVA_breseq_lambda_mixed_pop

File Name	Description	Sample

File Name	Description	Sample
`lambda_mixed_population.fastq`	Single-end Illumina 36-bp reads	Evolved lambda bacteriophage mixed population genome sequencing
`lambda.gbk`	Reference Genome	Bacteriophage lambda

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes, but it is computationally intense enough that it should not be run on the head node since we have reservations and theres no reason not to use them.

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

breseq command

cd $SCRATCH/GVA_breseq_lambda_mixed_pop
breseq -r lambda.gbk lambda_mixed_population.fastq

While breseq is running lets look at what the different parts of the command are actually doing:

part	puprose

part	puprose
-r lambda.gbk	Use the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastq	breseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping

This is the absolute minimal command that breseq can do anything with: a reference file and a fastq file. When you executed the command without any options you saw more options and if you use breseq --help you will see more still. This will finish very quickly (less than 1 minute) with a final line of "+++ SUCCESSFULLY COMPLETED". If you instead see something different as the last line before getting your prompt back, get my attention.

Evaluating output

breseq produced a lot of directories beginning 01_sequence_conversion, 02_reference_alignment, ... Each of these contains intermediate files that are used to 'pickup where it left off' if the run doesn't complete successfully. These can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer. Use scp to transfer the contents of the output directory back to your local computer.

Remember this tutorial is available if you need help transferring files.

Lonestar5 Breseq Tutorial GVA2020