Advanced breseq GVA2019

Introduction

We analyzed a single lambda phage evolution experiment as both an introduction to variant visualization in the first part of Wednesday's class, but breseq is designed to be able to with "small" genomes, not just "tiny" genomes. It also has more advanced tools for visualizing mutations across multiple genomes and more complete identification of all mutations.

Objectives

  • Run breseq on 7 different E. coli clones evolved for 40,000 generations
  • Use the packaged gdtools commands to generate a comparison table of all mutations from all 7 samples
  • Learn shortcuts for compressing, transfering, and extracting multiple folders/files with single commands
  • Visualize the data off of TACC

Table of Contents

Setting up the run

The data files for this example are in the following path: $BI/ngs_course/ecoli_clones/data/ go ahead and copy them to a new folder in your $SCRATCH directory called GVA_breseq_multi-sample:

 By this point in the class you should have a real good idea of how to do this, but click here if you need more help
location of data files
cds
mkdir GVA_breseq_multi-sample
cd GVA_breseq_multi-sample
cp $BI/ngs_course/ecoli_clones/data/* .

if everything worked correctly, you should see the following files. We've provided a bit more context to what those files actually are:

File Name

Description

Sample

SRR030252_1.fastq SRR030252_2.fastq

Paired-end Illumina 36-bp reads

0K generation evolved E. coli strain

SRR030253_1.fastq SRR030253_2.fastq

Paired-end Illumina 36-bp reads

2K generation evolved E. coli strain

SRR030254_1.fastq SRR030254_2.fastq

Paired-end Illumina 36-bp reads

5K generation evolved E. coli strain

SRR030255_1.fastq SRR030255_2.fastq

Paired-end Illumina 36-bp reads

10K generation evolved E. coli strain

SRR030256_1.fastq SRR030256_2.fastq

Paired-end Illumina 36-bp reads

15K generation evolved E. coli strain

SRR030257_1.fastq SRR030257_2.fastq

Paired-end Illumina 36-bp reads

20K generation evolved E. coli strain

SRR030258_1.fastq SRR030258_2.fastq

Paired-end Illumina 36-bp reads

40K generation evolved E. coli strain

NC_012967.1.fasta

Reference Genome

E. coli B str. REL606

breseq will take a little longer to run on these sequences, so this give us an opportunity to run several commands at the same time making use of the multiple cores on a single processor. You'll want each command (line) in the commands file to look something like this:

breseq -r NC_012967.1.gbk -o output_<XX>K SRR030252_1.fastq SRR030252_2.fastq &> <XX>K.log.txt

Notice we've added some additional options:

partpuprose
&> <XX>00K.log.txtRedirect both the standard output and the standard error streams to a file called <XX>00k.log.txt. It is important that you replace the <XX> to send it to different files, but KEEP the &> as those are telling the command line to send the streams to that file.
-o output_<xx>00kall of those output directories should be put in the specified directory, instead of the current directory. If we don't include this (and change the <XX>), then we will end up writing the output from all of the runs on top of one other. The program will undoubtedly get confused, possibly crash, and generally it will be a mess.
&run the preceding command in the background. This is required so all the commands will run at once
 Click here for commands solution

Normally, we would use all of our data and use both R1 and R2 data, but to speed things up for the class, we will only use R1 data. Put the following commands into a new file called "commands" using nano.

Example commands file
breseq -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq &> 00K.log.txt &
breseq -r NC_012967.1.gbk -o output_02K SRR030253_1.fastq &> 02K.log.txt &
breseq -r NC_012967.1.gbk -o output_05K SRR030254_1.fastq &> 05K.log.txt &
breseq -r NC_012967.1.gbk -o output_10K SRR030255_1.fastq &> 10K.log.txt &
breseq -r NC_012967.1.gbk -o output_15K SRR030256_1.fastq &> 15K.log.txt &
breseq -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq &> 20K.log.txt &
breseq -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq &> 40K.log.txt &

If you are especially interested in seeing a 'full' data set, you can use the following commands, but the runs will take longer as they have more reads associated with them.

Example commands file
breseq -r NC_012967.1.gbk -o output_00K SRR030252_1.fastq SRR030252_2.fastq &> 00K.log.txt &
breseq -r NC_012967.1.gbk -o output_02K SRR030253_1.fastq SRR030253_2.fastq &> 02K.log.txt &
breseq -r NC_012967.1.gbk -o output_05K SRR030254_1.fastq SRR030254_2.fastq &> 05K.log.txt &
breseq -r NC_012967.1.gbk -o output_10K SRR030255_1.fastq SRR030255_2.fastq &> 10K.log.txt &
breseq -r NC_012967.1.gbk -o output_15K SRR030256_1.fastq SRR030256_2.fastq &> 15K.log.txt &
breseq -r NC_012967.1.gbk -o output_20K SRR030257_1.fastq SRR030257_2.fastq &> 20K.log.txt &
breseq -r NC_012967.1.gbk -o output_40K SRR030258_1.fastq SRR030258_2.fastq &> 40K.log.txt &
 Bonus question ... why did we use -j 6?
  1. Each lonestar5 compute node has 48 processors available.
  2. We have 7 samples to run, so by requesting 6 processors, we allow all 7 samples to start at the same time leaving us with 6 unused processors.
  3. If we had requested 7 processors for each sample, only 6 samples would start initially and the 7th would start after the first finishes.
  4. These are similar to the considerations you have to make with the job submission system we will go over at the end of tomorrow's class.


Executing the run:

how to execute all the commands at once
chmod +x commands  # makes the commands file executable
./commands  # executes the commands file. the './' portion tells the command line to execute a file from the current directory as it is not in your $PATH variable.

Since we started running all of the breseq runs in the background, and are redirecting all of the output to log files, we need to make sure that things are actually running as we intended. The following 2 commands are good ways to check on this:

2 commands to check everything is running
ps  # the process command will return information on what processes are currently running. You should see 7 breseq commands
tail *.log.txt  # this should show you output that does not look like an error.

Even with just using the R1 data, this will take several minutes to complete (~40 minutes). While it is running, you should move onto reading through the next section learning how you will create comparison tables to easily visualize the data across the different samples. If you have all that figured out you may want to start one of the other tutorials while the above commands finish. Just remember to check back periodically as the 'completion' notifications of background jobs can be easy to miss.

Using gdtools to compare samples.

Assuming everything is running as it should (you should ask if you are not sure), we need to start exploring the gdtools commands. gdtools is a series of commands designed to function on "genome diff" (.gd) files (which are the main command line type of breseq command line analysis) that come packaged with breseq. As we have seen with numerous other programs throughout the course typing gdtools by itself will return a list of options. In the case of gdtools, it returns the list of subcommands that are available. Each has its own use, but for the purpose of this tutorial we are interested in comparing the output of multiple samples to one another. 

 What sub command do you think will accomplish this?

# from the gdtools output we see that:

ANNOTATE (or COMPARE)  annotate the effects of mutations and compare multiple samples
 What information do we need to give to this command to generate the output we want (hints inside)?
 Need a hint?

We identified gdtools compare as being the command and subcommand that we wanted to run to generate the comparison. Try typing just that to get a list of expected arguments.

 Click for solution

gdtools ANNOTATE/COMPARE [-o annotated.html] -r reference.gbk input.1.gd [input.2.gd ... ]

Things listed inside of "[" and "]" marks are optional arguments while things not inside of brackets are required.

In our command we sent the output of each sample to its own folder using the -o output_<XX>K portion of the command line. Among the multiple directories that breseq created within each output directory, you will find one named output which contains a file named output.gd. These are the files that we need to feed into the gdtools compare command (along with the NC_012967.1.gbk reference file) to generate the comparison table we want. In this course we have mentioned (though not emphasized) that using consitant nomenclature is important which it is, but now we come to a point where the consistancy has created a problem. Because all of the files are named "output.gd" we can neither move all of the commands into a single folder (they would overwrite one another as they have the same name) or the resulting table would display sample IDs as "output" "output" "output" ... not the most helpful information to say the least.

To circumvent that we will need to rename each file to something more meaningful (say the name of the directory that the output/output.gd file was found in (aka the sample name)). While this command can be run on TACC we'll wait until we transfer the files back to our local machine and do it there.

Visualizing output

Can you figure out how to archive all of the output directories and copy only those files (and not all of the very large intermediate files) back to your machine? - without deleting any files?

 Click here for a hint without the answer

You will want to use the tar command again, but you will need to create specific tar.gz files for each output directory with a different name.

click here to check your solution, or get the answer
tar -cvzf output00K.tar.gz output_00K/output

# etc ... read on

For 7 samples this is certainly doable, what happens when you have 1,000 samples? If only this was a big enough problem for enough researchers that someone stepped up and wrote a small script to do this in an automated way... Oh wait there is the BioITeam, and within the bin directory you will find a script named export-breseq.sh ... Normally this script assumes that all samples output were directed to a single output folder before being named based on the specific samples. In other words, rather than using -o output_<XX> as the output directory, we would use -o breseq_output/output_<XX>so all of the output folders are in a single subdirectory. The export-breseq.sh command will still work, but the output will end up 1 directory higher than where you started ... in this case, it will end up in the $SCRATCH directory.

Simplified way of exporting all output directories
cds
cd GVA_breseq_multi-sample
export-breseq.sh
mv ../05_Output* .

Now lets transfer the 05_Output_Export.tar.gz file back to our local computer

 Here are the commands we showed you for the previous example (with the trick of getting a single compressed output directory you just learned) to transfer so you don't have to scroll back and forth. See if you can remember how to do it without going back over the lesson.

To use scp you will need to run it in a terminal that is on your desktop and not on the remote TACC system. It can be tricky to figure out where the files are on the remote TACC system, because your desktop won't understand what $HOME, $WORK, $SCRATCH mean (they are only defined on TACC).

To figure out the full path to your file, you can use the pwd command in your terminal on TACC in the window that you ran breseq in (it should contain an "output" folder). Rather than copying the entire contents of the folder which can be rather large, we are going to add a twist of compressing the entire folder into a single compressed archive using the tar command so that the size will be smaller and it will transfer faster:

Command to type in TACC
pwd

Then you can then copy paste that information (in the correct position) into the scp command on the desktop's command line:

Command to type in the desktop's terminal window
cd Desktop
scp -r <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/05_Output_Export.tar.gz .
tar -xvzf 05_Output_Export.tar.gz
cd 05_Output_Export

Now you will see 7 .tar.gz files in the directory, and you can extract each of them 1 at a time by using the tar -xvzf <FILE> command. Again this seems acceptable for 7 samples but not when dealing with 100s of samples. The 1 liner below will extract .tar.gz files sequentially:

for f in *.tar.gz;do tar xvzf $f;done

This command will extract all files ending in .tar.gz regardless of their origin. As such the following 2 commands might be of use to you:

echo "for f in *.tar.gz;do tar xvzf $f;done" > detar
chmod +x detar

Now you have a file named detar that executes the command without having to copy paste it. The only thing left is to move it to somewhere in your path. This will work on either your personal computer or TACC the same way.

Now you can click through each individual sample's output to see the different mutations in any given sample, but the whole point of this tutorial has been to look at multiple samples at once.

Generating compare tables.

As mentioned above, we need to change the names of the output.gd files to something more informative. You can manually do this by going into each sample directory, each output directory, and using the cp command to rename the output.gd file to something more informative.

cd output_00K_output
cp output.gd 00K_output.gd# ETC

Once again there is an easier way to do this all at once while generating a new directory to hold all of the outputs:

# navigate to the directory containing all of the extracted output directories: the ls command should return the following folders: 
# output_00K_output output_05K_output output_15K_output output_40K_output output_02K_output output_10K_output output_20K_output
mkdir gds
for f in *_output; do n=$(echo $f|sed 's/_output//'); cp $f/output.gd gds/$n.gd;done
cd gds

Now that we have nicely named .gd files (the ls command can confirm this), its time to use the gdtools compare command. As your local computer doesn't have breseq on it we will need to move these .gd files back to TACC so we can access the gdtools commands. This is done on purpose to give you another example of how to transfer files from your local computer to TACC (something you will do almost as frequently as the inverse with your own work though not so in this class).

The SCP command functionality works the same way when transfering to TACC, you just have to make sure you supply your user name on the TACC part of the address in this case the destination
# On TACC:
cd $SCRATCH/GVA_breseq_multi-sample
mkdir comparison_table
cd comparison_table
pwd
 
# On your local computer:
scp *.gd <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>

Now working again on TACC. Earlier we determined the formatting for the gdtools compare command. Either of the following commands will work. Compare the 2 and decide which you think is better:

gdtools compare -r ../NC_012967.1.gbk output_00K.gd output_02K.gd output_05K.gd output_10K.gd output_15K.gd output_20K.gd output_40K.gd
# OR more simply...
gdtools compare -r ../NC_012967.1.gbk *.gd 

Now transfer the new annotated.html file back to your local computer.

The SCP command functionality works the same way when transfering to TACC, you just have to make sure you supply your user name on the TACC part of the address in this case the destination
# On TACC:
pwd
 
# On your local computer:
scp <username>@ls5.tacc.utexas.edu:<the_directory_returned_by_pwd>/annotated.html .

Open the annotated.html and examine the results, does this make sense with what you would expected from an evolution experiment?

Back to the GVA2019 page