Advanced breseq GVA2017
Introduction
We analyzed a single lambda phage evolution experiment as both an introduction to variant visualization in the first part of Wednesday's class, but breseq is designed to be able to with "small" genomes, not just "tiny" genomes. It also has more advanced tools for visualizing mutations across multiple genomes and more complete identification of all mutations.
Objectives
- Run breseq on 7 different E. coli clones evolved for 40,000 generations
- Use the packaged gdtools commands to generate a comparison table of all mutations from all 7 samples
- Learn shortcuts for compressing, transfering, and extracting multiple folders/files with single commands
- Visualize the data off of TACC
Table of Contents
Setting up the run
The data files for this example are in the following path: $BI/ngs_course/ecoli_clones/data/ go ahead and copy them to a new folder in your $SCRATCH directory called BDIB_breseq_multi-sample
:
if everything worked correctly, you should see the following files. We've provided a bit more context to what those files actually are:
File Name | Description | Sample |
---|---|---|
| Paired-end Illumina 36-bp reads | 0K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 2K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 5K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 10K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 15K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 20K generation evolved E. coli strain |
| Paired-end Illumina 36-bp reads | 40K generation evolved E. coli strain |
| Reference Genome | E. coli B str. REL606 |
breseq will take a little longer to run on these sequences, so this give us an opportunity to run several commands at the same time making use of the multiple cores on a single processor. You'll want each command (line) in the commands file to look something like this:
breseq -j 6 -r NC_012967.1.gbk -o output_<XX>K SRR030252_1.fastq SRR030252_2.fastq &> <XX>K.log.txt
Notice we've added some additional options:
part | puprose |
---|---|
&> <XX>00K.log.txt | Redirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. It is important that you replace the <XX> to send it to different files, but KEEP the &> as those are telling the command line to send the streams to that file. |
-o output_<xx>00k | all of those output directories should be put in the specified directory, instead of the current directory. If we don't include this (and change the <XX>), then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess. |
& | run the preceding command in the background. This is required so all the commands will run at once |
Executing the run:
chmod +x commands # makes the commands file executable module unload samtools ./commands # executes the commands file. the './' portion tells the command line to execute a file from the current directory as it is not in your $PATH variable.
Since we started running all of the breseq runs in the background, and are redirecting all of the output to log files, we need to make sure that things are actually running as we intended. The following 2 commands are good ways to check on this:
Even with just using the R1 data, this will take several minutes to complete (~40 minutes). While it is running, you should move onto reading through the next section learning how you will create comparison tables to easily visualize the data across the different samples. If you have all that figured out you may want to start ne of the other tutorials while the above commands finish. Just remember to check back periodically as the 'completion' notifications of background jobs can be easy to miss.
Using gdtools to compare samples.
Assuming everything is running as it should (you should ask if you are not sure), we need to start exploring the gdtools commands. gdtools is a series of commands designed to function on "genome diff" (.gd) files (which are the main command line type of breseq command line analysis) that come packaged with breseq. As we have seen with numerous other programs throughout the course typing gdtools by itself will return a list of options. In the case of gdtools, it returns the list of subcommands that are available. Each has its own use, but for the purpose of this tutorial we are interested in comparing the output of multiple samples to one another.