Lonestar5 Breseq Tutorial GVA2020

Overview

breseq is a tool developed by the Barrick lab intended for analyzing genome re-sequencing data for bacteria. It is primarily used to analyze laboratory evolution experiments with microbes. In these experiments, there is usually a high-quality reference genome for the ancestral strain, and one is interested in exhaustively finding all of the mutations that occurred during the evolution experiment. Then one might want to construct a phylogenetic tree of individuals samples from a single population or determine whether the same gene is mutated in many independent evolution experiments in an environment.

Learning objectives:

  • Quick introduction to a self contained/automated pipeline to identify mutations.
  • Explain the types of mutations found in a complete manner before using methods better suited for higher order organisms.
  • Examine the same data used in the Mapping, and SNV tutorials as breseq output.


Input data / expectations:

  • Haploid reference genome
  • Relatively small (<20 Mb) reference genome
  • Average genomic coverage > 30-fold
  • Less than ~1,000 mutations expected
  • Detects SNVs and SVs from split read alignment of reads (does not use paired-end distance information)
  • Produces annotated HTML output

You can learn a great deal more about breseq by reading the Online Documentation.

Here is a rough outline of the workflow in breseq with proposed additions.


This does mean that breseq is not suited for diploids, and other very large genomes. GATK is a much larger pipeline that has many more additional options. While a tutorial for it can be found here, and students working with human data have still had positive feedback about the remainder of this tutorial.

breseq access

In order to run breseq, we need to make sure breseq was made available to you when we set up your .bashrc file on the first day.

Check that you have access to breseq
tacc:~$ which breseq
# expected output: /corral-repl/utexas/BioITeam/breseq/bin/breseq
tacc:~$ breseq --version
# expected output: breseq 0.35.1

breseq should now run using the breseq command. breseq without any options will show you what the command expectations are.

A consolidated explanation of help files and my experience with them

Not all programs are configured to tell you what it expects just from typing the name of it. Some require the name of the command followed by one of the following: -h or --help or ? while others require preceding the command name with "man" (short for manual). If all that fails google is your friend for all programs not named "R" ... google is still your best bet, but it won't be your friend. In the specific case of R, adding the word stats somewhere to the search will greatly help things in my experience.

Bacteriophage lambda data set

First, we'll run breseq on a small data set to be sure that it is installed correctly, and to get a taste for what the output looks like. This sample is a mixed population of bacteriophage lambda that was co-evolved in lab with its E. coli hosts.

Data

The data files for this tutorial is located in following location:

$BI/ngs_course/lambda_mixed_pop/data/

Copy the contents of this directory to a new directory called GVA_breseq_lambda_mixed_pop in your scratch directory.

By this point in the course you should not need to expand this box to see the suggested solution. You should continue expanding boxes such as this to make sure you are not drifting too far
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop
cp $BI/ngs_course/lambda_mixed_pop/data/* $SCRATCH/GVA_breseq_lambda_mixed_pop

Possible errors on idev nodes

As mentioned over zoom this is one instance that i know for sure copying these files while on an idev node may not work giving Input/Output errors. If you are already on an idev session and this does not work, just use the logout command to exit the idev session and retry the copy command. If both methods fail, please get my attention so we can figure out what is going on.

By a similar token if you actually are on an idev node and are able to transfer the files, please let me know as it may help figure out what the real source of the error is.

Now use the ls command to see what files were copied. again, you should not need to expand this to get the output listed below
ls $SCRATCH/GVA_breseq_lambda_mixed_pop

File Name

Description

Sample

lambda_mixed_population.fastq

Single-end Illumina 36-bp reads

Evolved lambda bacteriophage mixed population genome sequencing

lambda.gbk

Reference Genome

Bacteriophage lambda

Running breseq

Because this data set is relatively small (roughly 100x coverage of a 48,000 bp genome), a breseq run will take < 5 minutes, but it is computationally intense enough that it should not be run on the head node since we have reservations and theres no reason not to use them. 

Remember to make sure you are on an idev done

For reasons discussed numerous times throughout the course already, please be sure you are on an idev done. Remember the hostname command and showq -u can be used to check if you are on one of the login nodes or one of the compute nodes. If you need more information or help re-launching a new idev node, please see this tutorial.

breseq command
cd $SCRATCH/GVA_breseq_lambda_mixed_pop
breseq -r lambda.gbk lambda_mixed_population.fastq

While breseq is running lets look at what the different parts of the command are actually doing:

partpuprose
-r lambda.gbkUse the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastqbreseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping

This is the absolute minimal command that breseq can do anything with: a reference file and a fastq file. When you executed the command without any options you saw more options and if you use breseq --help you will see more still. This will finish very quickly (less than 1 minute) with a final line of "+++   SUCCESSFULLY COMPLETED". If you instead see something different as the last line before getting your prompt back, get my attention.

Evaluating output

breseq produced a lot of directories beginning 01_sequence_conversion02_reference_alignment, ... Each of these contains intermediate files that are used to 'pickup where it left off' if the run doesn't complete successfully. These can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain files used to create .html output files and .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestar. Therefore we first need to copy the output directory back to your desktop computer. Use scp to transfer the contents of the output directory back to your local computer.

Remember this tutorial is available if you need help transferring files. 

suggested compression command to prepare a single compressed. directory for transfer. This is similar to what we used for the IGV tutorial
cd $SCRATCH/GVA_breseq_lambda_mixed_pop
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window
Command to type in the desktop's terminal window to decompress the transferred archive after running the scp command
# scp command first ...
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

Navigate to the output directory in the finder and open the a file called index.html. This will open the results in a web browser window that you can click through different mutations by using clicking the 2 letter codes or symbols on the left hand side of each row. The summary statistics link at the top of the page provides useful information about the percent of reads mapping (73.9%) to the genome as well as a link overall coverage of the genome where you can see a line graph showing how many reads mapped to each location across the entire genome (125.3 on average though varies between more than 200 and less than 50 depending on the location). The coverage summary graph is an awesome way to look for super large scale events in this case you can see the ~6kb deletion in the middle of the genome. The Mutation Predictions page is where most of the analysis time is spent in determining which mutations are important (or more rarely inaccurate).

Click around through the different mutations and examine their evidence to see what kinds of mutations you can identify. If you cant understand what type of mutation each line represents, or how the images should help you understand what the mutation is, please dont hesitate to interact over zoom.

Bacteriophage lambda data set repeated

Did you notice the name of the fastq file we used? lambda_mixed_population.fastq. As the name of the file implies, this file is actually from a mixed population of phage though we did not include any information about that fact in the breseq command we used. Further as you clicked around on some of the RA evidences you may have noticed that some of the mutations which listed reads as having aligned to the reference genome as well. This is similar to our initial SNV tutorial where the variant calls were made in a consensus diploid mode which forced the program to decide between variants being at ~50% and 100%, breseq only works on haploid organisms and thus assumed that any variants that were present must have been present at 100%.

Since we know it was a mixed population we can actually rerun the same fastq files against the same reference and add a flag for polymorphism mode (-p) and see what the difference in results is when we tell breseq that variants may exist at any frequency.

 Data, and running breseq

Commands to copy the input data from the first breseq run to a new folder, and rerun breseq on the same fastq and reference file in polymorphism mode. Since this copy command is between 2 scratch locations i doubt there will be issues with it, but remember to restart an idev node if you experience difficulties
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cp $SCRATCH/GVA_breseq_lambda_mixed_pop/lambda* $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cd $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
breseq -p -r lambda.gbk lambda_mixed_population.fastq

Evaluating output

Again you will need to transfer files back to your local computer to visualize the differences. The same exact compression command will work as the folder name is the same. In doing so you need to be careful where you transfer that file to on your local computer such that you don't overwrite the previously transferred files. Maybe add a _polymode to the directory you are transferring to as we did in our command above. Again help with SCP can be found here.

suggested compression command to prepare a single compressed. directory for transfer. This is similar to what we used for the IGV tutorial
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window
Command to type in the desktop's terminal window to decompress the transferred archive after running the scp command
# scp command first ...
tar -xvzf output.tar.gz  # the new "x" option at the front means eXtract 

When you look at the summary statistic page, you will see none of the output has changed until you get quite far down the page and find that this time it was run in full polymorphism mode. When you look at the mutation predictions page, you now see more total mutations (with most of the new mutations being at frequencies less than 20%), a new column listing the frequency each mutation was listed at (with variants at less than 100% showing up in green), and if you look closely some mutations that were previously listed in white and at 100% are now listed as less than 100% (ie 82.0%). Hopefully from the discussions we've been having to this point it makes sense that mutations that are real but at low frequency would be mistaken for 0% rather than 100% when those are the only 2 choices. Again feel free to get my attention if you have any questions about the output, such as wondering why there are so many mutations at 100% even in a mixed population sample.

E. coli data from Mapping, SNV tutorials:

As a reminder, the read files we were working with in the bowtie2 and SNV tutorials were originally downloaded from the NCBI Sequence Read Archive via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment. If that description sounds like what breseq was made for ... breseq was literally developed at least in part to anlyze this data.

Running this data set not recommended

Running this data set yourself is not actually recommended, or at least not recommended until you have completed some of the advanced breseq tutorials for 3 reasons:

  1. At best, this data set expected to take ~1 hour to complete.
  2. Version conflicts with bowtie2 between TACC and breseq prevent accessing the 'best' time.
  3. Potential issue with the launcher module.

3 can be worked around fairly simply as can be seen in the multiqc tutorial 2 will be handled in breseq installation tutorial after which the data will still take ~1hr.

Therefore while the information of how to run this data is below, it is recommended that the majority of you focus on just downloading the results of that run when I did it.

partial scp command to copy to the current directory of your local computer
scp <user>@ls5.tacc.utexas.edu:/corral-repl/utexas/BioITeam/ngs_course/lambda_mixed_pop/breseq_SRR030257_run_output_folder.tar.gz .

Once you have downloaded the compressed output to your local computer you can jump down to evaluating it, or read through the commands I used for the run and some more explanation of what is going on.


data

Like we did yesterday we'll start by downloading our reads and reference into a new folder on scratch:

Remember that to copy an entire folder requires the use of the recursive (-r) option.
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls

Running breseq

breseq command
breseq -j 48 -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz

As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. The other thing we have introduced is the -j 48 option which should allow breseq to access 48 processors. Unfortunately if you have not completed the breseq installation tutorial, there is a version conflict between the module version of bowtie2 and breseq and will therefore throw an error when breseq tries to call bowtie2.

As noted above, the run will still take ~1 hour even after you have installed your own version of breseq and therefore it is recommended that you use the launcher_creator.py script to send this job to the queue after you have completed the installation tutorial.

evaluating output

Here are the comments from the IGV tutorial evaluating the data:

 Some interesting example coordinates
  •  Coordinate 161,041. What gene is this in and what is the effect on the protein sequence?

    Gene is pcnB, mutation is a snp

  •  Coordinate 3,248,957. What gene is this in and what is the effect on the protein sequence?

    Gene is infB, mutation is a snp

  •  Coordinate 3,894,997. What type of mutation is this?

    Deletion of the rbsD gene

  •  Check out the rbsA gene region? What's going on here?

    There was a large deletion. Can you figure out the exact coordinates of the endpoints?

  •  Navigate to coordinate 3,289,962. Compare the results for different alignment programs and settings. Can you explain what's going on here?

    There is a 16 base deletion in the gltB gene reading frame.

  •  What is going on in the pykF gene region? You might see red read pairs. What does that mean? Can you guess what type of mutation occurred here?

    The read pairs are discordantly mapped. There was an insertion of a new copy of a mobile genetic element (an IS150 element) that exists at other locations in the reference sequence.

  • See if you can find more interesting locations. Recall in the IGV tutorial we had ~190 mutations, here we have ~40.

In addition to highlighting the power of the statistical testing going on under the hood of breseq to separate the signal from the noise the concise html output to visualize the mutations, hopefully you also see how you can answer the same questions even easier with breseq.

Additional tutorials dealing with breseq

Installing breseq

Advanced breseq and gdtools

Error correction

Additional information on analyzing the output

  • Deatherage, D.E.Barrick, J.E.. (2014) Identification of mutations in laboratory-evolved microbes from next-generation sequencing data using breseqMethods Mol. Biol. 1151:165-188. «PubMed»

This tutorial was substantially reformatted from the most recent version found here. Our thanks to the previous instructors.

Return to the GVA2020 page