Lonestar5 Breseq Tutorial GVA2020
- 1 Overview
- 2 Learning objectives:
- 3 Input data / expectations:
- 4 breseq access
- 5 Bacteriophage lambda data set
- 5.1 Data
- 5.2 Running breseq
- 5.2.1 breseq command
- 5.3 Evaluating output
- 6 Bacteriophage lambda data set repeated
- 7 E. coli data from Mapping, SNV tutorials:
- 8 Additional tutorials dealing with breseq
- 9 Additional information on analyzing the output
Overview
breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.
Learning objectives:
Quick introduction to a self contained/automated pipeline to identify mutations.
Explain the types of mutations found in a complete manner before using methods better suited for higher order organisms.
Examine the same data used in the Mapping, and SNV tutorials as breseq output.
Input data / expectations:
Haploid reference genome
Relatively small (<20 Mb) reference genome
Average genomic coverage > 30-fold
Less than ~1,000 mutations expected
Detects SNVs and SVs from split read alignment of reads (does not use paired-end distance information)
Produces annotated HTML output
You can learn a great deal more about breseq by reading the Online Documentation.
Here is a rough outline of the workflow in breseq with proposed additions.
This does mean that breseq is not suited for diploids, and other very large genomes. GATK is a much larger pipeline that has many more additional options. While a tutorial for it can be found here, and students working with human data have still had positive feedback about the remainder of this tutorial.
breseq access
In order to run breseq, we need to make sure breseq was made available to you when we set up your .bashrc file on the first day.
Check that you have access to breseq
tacc:~$ which breseq
# expected output: /corral-repl/utexas/BioITeam/breseq/bin/breseq
tacc:~$ breseq --version
# expected output: breseq 0.35.1breseq should now run using the breseq command. breseq without any options will show you what the command expectations are.
A consolidated explanation of help files and my experience with them
Not all programs are configured to tell you what it expects just from typing the name of it. Some require the name of the command followed by one of the following: -h or --help or ? while others require preceding the command name with "man" (short for manual). If all that fails google is your friend for all programs not named "R" ... google is still your best bet, but it won't be your friend. In the specific case of R, adding the word stats somewhere to the search will greatly help things in my experience.
Bacteriophage lambda data set
First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.
Data
The data files for this tutorial is located in following location:
$BI/ngs_course/lambda_mixed_pop/data/
Copy the contents of this directory to a new directory called GVA_breseq_lambda_mixed_pop in your scratch directory.
Possible errors on idev nodes
As mentioned over zoom this is one instance that i know for sure copying these files while on an idev node may not work giving Input/Output errors. If you are already on an idev session and this does not work, just use the logout command to exit the idev session and retry the copy command. If both methods fail, please get my attention so we can figure out what is going on.
By a similar token if you actually are on an idev node and are able to transfer the files, please let me know as it may help figure out what the real source of the error is.
File Name | Description | Sample |
|---|---|---|
| Single-end Illumina 36-bp reads | Evolved lambda bacteriophage mixed population genome sequencing |
| Reference Genome | Bacteriophage lambda |
Running breseq
Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes, but it is computationally intense enough that it should not be run on the head node since we have reservations and theres no reason not to use them.
Remember to make sure you are on an idev done
For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.
breseq command
cd $SCRATCH/GVA_breseq_lambda_mixed_pop
breseq -r lambda.gbk lambda_mixed_population.fastq
While breseq is running lets look at what the different parts of the command are actually doing:
part | puprose |
|---|---|
-r lambda.gbk | Use the lambda.gbk file as the reference to identify specific mutations |
lambda_mixed_population.fastq | breseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping |
This is the absolute minimal command that breseq can do anything with: a reference file and a fastq file. When you executed the command without any options you saw more options and if you use breseq --help you will see more still. This will finish very quickly (less than 1 minute) with a final line of "+++ SUCCESSFULLY COMPLETED". If you instead see something different as the last line before getting your prompt back, get my attention.
Evaluating output
breseq produced a lot of directories beginning 01_sequence_conversion, 02_reference_alignment, ... Each of these contains intermediate files that are used to 'pickup where it left off' if the run doesn't complete successfully. These can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer. Use scp to transfer the contents of the output directory back to your local computer.
Remember this tutorial is available if you need help transferring files.