Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
languagebash
titlebreseq command
cd $SCRATCH/GVA_breseq_lambda_mixed_pop
breseq -j 68 -r lambda.gbk lambda_mixed_population.fastq

While breseq is running lets look at what the different parts of the command are actually doing:

partpuprose
-j 68use 68 processors (recall we saw this was best in the rough testing using of bowtie2
-r lambda.gbkUse the lambda.gbk file as the reference to identify specific mutations
lambda_mixed_population.fastqbreseq assumes any argument not preceded by a - option to be an input fastq file to be used for mapping

This is the The absolute minimal command that breseq can do anything with : is a reference file and a fastq file. We've added the -j option to use more processors to speed things up a bit. When you executed the command without any options you saw more options and if you use breseq --help you will see more still. This will finish very quickly (less than 1 minute5 minutes) with a final line of "+++   SUCCESSFULLY COMPLETED". If you instead see something different as the last line before getting your prompt back, get my attention.

...

breseq produced a lot of directories beginning 01_sequence_conversion02_reference_alignment, ... Each of these contains intermediate files that are used to 'pickup where it left off' if the run doesn't complete successfully. These can be deleted when the run completes, or explored if you are interested in the inner guts of what is going on. More importantly, breseq will also produce two directories called: data and output which contain 1. files used to create .html output files and 2. actual .html output files respectively. The most interesting files are the .html files which can't be viewed directly on lonestarstampede2. Therefore we first need to copy the output directory back to your desktop computer. Use scp to transfer the contents of the output directory back to your local computer.

Remember this tutorial is available if you need help transferring files. 

Code Block
languagebash
titlesuggested compression command to prepare a single compressed. directory for transfer. This is similar to what we used for the IGV tutorial
cd $SCRATCH/GVA_breseq_lambda_mixed_pop
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window

...

Since we know it was a mixed population we can actually rerun the same fastq files against the same reference and add a flag for polymorphism mode (-p) and see what the difference in results is when we tell breseq that variants may exist at any frequency.

...

As this command takes a little over 5 minutes to complete, you may want to go back and look more at the consensus results you just transferred back, or forward and see what comes next.

 Data, and running breseq

Code Block
languagebash
titleCommands to copy the input data from the first breseq run to a new folder, and rerun breseq on the same fastq and reference file in polymorphism mode. Since this copy command is between 2 scratch locations i doubt there will be issues with it, but remember to restart an idev node if you experience difficulties
mkdir $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cp $SCRATCH/GVA_breseq_lambda_mixed_pop/lambda* $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
cd $SCRATCH/GVA_breseq_lambda_mixed_pop_polymode
breseq -j 68 -p -r lambda.gbk lambda_mixed_population.fastq

...

Again you will need to transfer files back to your local computer to visualize the differences. The same exact compression command will work as the folder name is the same. In doing so you need to be careful where you transfer that file to on your local computer such that you don't overwrite the previously transferred files. Maybe add a _polymode to the directory you are transferring to as we did in our command above. Again help with SCP can be found here.

Code Block
languagebash
titlesuggested compression command to prepare a single compressed. directory for transfer. This is similar to what we used for the IGV tutorial
tar -czvf output.tar.gz output  # the czvf options in order mean Create, Zip, Verbose, Force
pwd # helpful for the Remote (Right) terminal window

...

As a reminder, the read files we were working with in the bowtie2 and SNV tutorials were originally downloaded from the NCBI Sequence Read Archive via  or via the corresponding European Nucleotide Archive record. They are Illumina Genome Analyzer sequencing of a paired-end library from a (haploid) E. coli clone that was isolated from a population of bacteria that had evolved for 20,000 generations in the laboratory as part of a long-term evolution experiment (Barrick et al, 2009). The reference genome is the ancestor of this E. coli population (strain REL606), so we expect the read sample to have differences from this reference that correspond to mutations that arose during the evolution experiment. If that description sounds like what breseq was made for ... breseq was literally developed at least in part to anlyze analyze this data.

...

data

...

Running this data set yourself is not actually recommended, or at least not recommended until you have completed some of the advanced breseq tutorials for 3 reasons:

  1. At best, this data set expected to take ~1 hour to complete.
  2. Version conflicts with bowtie2 between TACC and breseq prevent accessing the 'best' time.
  3. Potential issue with the launcher module.

3 can be worked around fairly simply as can be seen in the multiqc tutorial 2 will be handled in breseq installation tutorial after which the data will still take ~1hr.

...

Like we did earlier we'll start by downloading our reads and reference into a new folder on scratch. As the analysis in the rest of this tutorial will focus on the job queue system rather than analyzing interactively in an idev session, you should logout of any idev session you may be in now.

Code Block
languagebash
titleRemember that to copy an entire folder requires the use of the recursive (-r) option.
collapsetrue
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls

The breseq_commands file

As noted above, the run will still take ~1 hour to run. Making this a great opportunity to use the job queue system rather than interactively on an idev session. As we will discuss on Friday, the job queue system is the common way you will use TACC to analyze your own data. For now we will be ignoring the theory and practice of how to use the system and instead focus on getting our analysis into the queue system so it can run overnight before tomorrow's class. As you will be reminded it is recommended that you start Thursday's class with evaluating this job submission in case you choose to make use of it during Thursday's class. 

Once you have downloaded the compressed output to your local computer you can jump down to evaluating it, or read through the commands I used for the run and some more explanation of what is going on.

Code Block
languagebash
title
partial scp command to copy to the current directory of your local computer
scp <user>@ls5.tacc.utexas.edu:/corral-repl/utexas/BioITeam/ngs_course/lambda_mixed_pop/breseq_SRR030257_run_output_folder.tar.gz .

data

...

As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In this example, breseq takes 2 fastq files, 1 as a non-compressed file, the other as a gzipped file. The command we want to run is:
breseq -j 68 -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz

Because we are going to use the job queue system, rather than entering the above command into the terminal prompt, instead copy paste the command into a new file named "breseq_commands" using nano. Recall from the introduction tutorial:

Code Block
languagebash
titleRemember that to copy an entire folder requires the use of the recursive (-r) option.
collapsetrue
cds
cp -r $BI/gva_course/mapping/data GVA_breseq_comparison_to_bowtieSamtoolsTutorials
cd GVA_breseq_comparison_to_bowtieSamtoolsTutorials
ls

Running breseq

Code Block
languagebash
titlebreseq commandHow to create a new file named breseq_commands and start editing it in the nano text editor
nano breseq_commands

After you paste the breseq command into the editor, you can save the file using the control options listed at the bottom of the editor. Specifically, you'll want:

  • ctl-o - to save (aka write) the file 
  • ctl-x - exit nano


 Be sure that cat breseq_commands gives the following output exactly:

No Format
breseq -j 4868 -r NC_012967.1.gbk SRR030257_1.fastq SRR030257_2.fastq.gz

As mentioned early in the course, some programs can actually take compressed fastq files in as input and breseq is 1 such program. In the above example, it actually takes 2 fastq files in, 1 as a non-compressed file, the other as a gzipped file. The other thing we have introduced is the -j 48 option which should allow breseq to access 48 processors. Unfortunately if you have not completed the breseq installation tutorial, there is a version conflict between the module version of bowtie2 and breseq and will therefore throw an error when breseq tries to call bowtie2.

As noted above, the run will still take ~1 hour even after you have installed your own version of breseq and therefore it is recommended that you use the launcher_creator.py script to send this job to the queue after you have completed the installation tutorial.

...


...

The .slurm file

The job queue system takes 2 things:

  1. A commands file (like the one we just created)
  2. A slurm file with instructions of how the commands file is to be run

I've gone ahead and provided you a generic slurm file in the gva course portion of the BioITream. Copy it to your current directory.

Code Block
languagebash
titlecopy command
collapsetrue
cp /corral-repl/utexas/BioITeam/gva_course/GVA2021.launcher.slurm .

As the file provided is generic you will need to make a few changes to the file. Note that most of these lines have additional text to the right of the line. This commented text is present to help remind you what goes on each line, leaving it alone will not hurt anything, removing it may make it more difficult for you to remember what the purpose of the line is

Line numberAs isTo be
16

#SBATCH -J jobName

#SBATCH -J D3Breseq
21

#SBATCH -t 12:00:00

#SBATCH -t 4:00:00

22

##SBATCH --mail-user=ADD

#SBATCH --mail-user=<YourEmailAddress>

23

##SBATCH --mail-type=all

#SBATCH --mail-type=all

27

conda activate GVA2021

conda activate GVA-breseq

31

export LAUNCHER_JOB_FILE=commands

export LAUNCHER_JOB_FILE=breseq_commands

The changes to lines 22 and 23 are optional but will give you an idea of what types of email you could expect from TACC if you choose to use these options. Just be sure to pay attention to these 2 lines starting with a single # symbol after editing them.

Line 27 assumes that you have installed breseq into an environment named "GVA-breseq" earlier in this tutorial. If you used a different name, you will need to substitute it here. 

Again use ctl-o and ctl-x to save the file and exit.

Submitting the job

Now that we have the 2 things that the job queue system needs (commands file and slurm file to control the computer), all that is left is to submit the job using the sbatch command.

Code Block
languagebash
titlesbatch command for submitting your job
sbatch GVA2021.launcher.slurm

Your output should be similar to this:

No Format
-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer                 
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login4)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/0004/train402)...OK
--> Verifying availability of your work2 dir (/work2/0004/train402/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/0004/train402)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (UT-2015-05-18)...OK
--> Verifying that quota for filesystem /home1/0004/train402 is at 29.45% allocated...OK
--> Verifying that quota for filesystem /work2/0004/train402/stampede2 is at  0.01% allocated...OK
Submitted batch job 7909016

Checking your submitted job

The showq command can be used to check the status of any jobs you have submitted. Executing the command without any options will show you the status of all jobs currently on stampede2 (which can be interesting at least once to get a sense of the volume of work that TACC deals with). More useful is to provie the '-u' option so that you only see jobs that are related to your userid.

Initially your job will be listed in the "WAITING JOBS------------------------" section, though depending on how busy stampede2 is when you submit your job, it may move directly to the "ACTIVE JOBS--------------------" section. Once the breseq command finishes running, the job will move down into a 3rd section labled something similar to "COMPLETING JOBS--------------------".

  • If you do not see your job listed just after you tried to submit it, it is likely that there was some kind of error with your submission. Get my attention for help troubleshooting what is going on.
  • If it has been some time since your job submission (and especially if you saw it listed in the active job section), the job may have completed. Check for the expected breseq output using the ls command. If you entered your email address, you may have some emails providing additional information about this job.
    • You will likely want to start Thursday's class by performing these checks as if you are having difficulties interacting with the job queue system it may cause problems during Thursday's class depending on what tutorials you decide to do.


Note
titleRecall that the the first part of Wednesday's class is devoted to data visualization, and comparing these results to the IGV tutorial results

Rather than waiting a few hours for the breseq analysis to finish, you can copy the expected results of the analysis to your local computer and evaluate them now. Alternatively, you can wait until your analysis is complete and finish this tutorial then.

For those wishing to do the comparison now:

Code Block
languagebash
titlepartial scp command to copy to the current directory of your local computer
scp <user>@stampede2.tacc.utexas.edu:/corral-repl/utexas/BioITeam/ngs_course/lambda_mixed_pop/breseq_SRR030257_run_output_folder.tar.gz .

The scp tutorial is available if you need additional assistance copying these files back to your computer.

evaluating output

Here are the comments from the IGV tutorial evaluating the data:

Expand
titleSome interesting example coordinates
  • Expand
    titleCoordinate 161,041. What gene is this in and what is the effect on the protein sequence?

    Gene is pcnB, mutation is a snp

  • Expand
    titleCoordinate 3,248,957. What gene is this in and what is the effect on the protein sequence?

    Gene is infB, mutation is a snp

  • Expand
    titleCoordinate 3,894,997. What type of mutation is this?

    Deletion of the rbsD gene

  • Expand
    titleCheck out the rbsA gene region? What's going on here?

    There was a large deletion. Can you figure out the exact coordinates of the endpoints?

  • Expand
    titleNavigate to coordinate 3,289,962. Compare the results for different alignment programs and settings. Can you explain what's going on here?

    There is a 16 base deletion in the gltB gene reading frame.

  • Expand
    titleWhat is going on in the pykF gene region? You might see red read pairs. What does that mean? Can you guess what type of mutation occurred here?

    The read pairs are discordantly mapped. There was an insertion of a new copy of a mobile genetic element (an IS150 element) that exists at other locations in the reference sequence.

  • See how the breseq results compare, and see if you can find more interesting locations. Recall in the IGV tutorial we had ~190 mutations, here we have ~40.

...

Additional tutorials dealing with breseq

Installing breseq

Advanced breseq and gdtools

Error correction

Additional information on analyzing the output

...

This tutorial was substantially reformatted from the most recent version found here. Our thanks to the previous instructors.

Return to the GVA2020 GVA2021 page