2022 Running batch jobs at TACC
Reservations
Use our summer school reservation (CoreNGSday2) when submitting batch jobs to get higher priority on the ls6 normal queue today:
sbatch --reservation=CoreNGSday2 <batch_file>.slurmidev -m 180 -N 1 -A OTH21164 -r CoreNGSday2
Note that the reservation name (CoreNGSday2) is different from the TACC allocation/project for this class, which is OTH21164.
- 1 Compute cluster overview
- 2 Software at TACC
- 3 Job Execution
- 3.1 SLURM at a glance
- 3.2 Simple example
- 3.2.1 Copy simple commands
- 3.2.2 View simple commands
- 3.2.3 Create batch submission script for simple commands
- 3.2.4 Submit simple job to batch queue
- 3.2.5 Multi-character filename wildcarding
- 3.2.6 Multi-character filename wildcarding
- 3.2.7 Single character filename wildcarding
- 3.2.8 An echo command
- 3.2.9 Backtick evaluation
- 3.3 Job parameters
- 3.3.1 launcher_creator.py
- 3.3.2 job name and commands file
- 3.3.3 queues and runtime
- 3.3.4 allocation and SUs
- 3.3.4.1 ALLOCATION setting in .bashrc
- 3.3.5 wayness (tasks per node)
- 3.4 Wayness example
- 4 Some best practices
- 5 Interactive sessions (idev)
Compute cluster overview
When you SSH into ls6, your session is assigned to one of a small set of login nodes (also called head nodes). These are not the cluster compute nodes that will run your jobs.
Think of a node as a computer, like your laptop, but probably with more cores and memory. Now multiply that computer a thousand or more, and you have a cluster.
The small set of login nodes are a shared resource (type the users command to see everyone currently logged in) and are not meant for running interactive programs – for that you submit a description of what you want done to a batch system, which farms the work out to one or more compute nodes.
On the other hand, the login nodes are intended for copying files to and from TACC, so they have a lot of network bandwidth while compute nodes have limited network bandwidth.
So follow these guidelines:
Do not perform substantial computation on the login nodes.
They are closely monitored, and you will get warnings from the TACC admin folks!
Code is usually developed and tested somewhere other than TACC, and only moved over when pretty solid.
Do not perform significant network access from your batch jobs.
Instead, stage your data from a login node onto $SCRATCH before submitting your job.
Lonestar6 and Stampede2 overview and comparison
Here is a comparison of the configurations and ls6 and stampede2. As you can see, stampede2 is the larger cluster, launched in 2017, but ls6, launched this year, has fewer but more powerful nodes.
| ls6 | stampede2 |
|---|---|---|
login nodes | 3 128 cores each | 6 28 cores each |
standard compute nodes | 560 AMD Epyc Milan processors 128 cores per node | 4,200 KNL (Knights Landing) processors
1,736 SKX (Skylake) processors
|
GPU nodes | 16 AMD Epyc Milan processors 128 cores per nod 2x NVIDIA A100 GPUs | -- |
batch system | SLURM | SLURM |
maximum job run time | 48 hours, normal queue 2 hours, development queue | 96 hours on KNL nodes, normal queue 48 hours on SKX nodes, normal queue 2 hours, development queue |
Note the use of the term virtual core above on stampede2. Compute cores are standalone processors – mini CPUs, each of which can execute separate sets of instructions. However modern cores may also have hyper-threading enabled, where a single core can appear as more than one virtual processor to the operating system (see https://en.wikipedia.org/wiki/Hyper-threading for more on hyper-threading). For example, stampede2 nodes have 2 or 4 hyperthreads (HTs) per core. So KNL nodes with 4 HTs for each of the 68 physical cores, have a total of 272 virtual cores.
User guides for ls6 and stampede2 can be found at:
Unfortunately, the TACC user guides are aimed towards a different user community – the weather modelers and aerodynamic flow simulators who need very fast matrix manipulation and other high performance computing (HPC) features. The usage patterns for bioinformatics – generally running 3rd party tools on many different datasets – is rather a special case for HPC. TACC calls our type of processing "parameter sweep jobs" and has a special process for running them, using their launcher module.
Software at TACC
Programs and your $PATH
When you type in the name of an arbitrary program (ls for example), how does the shell know where to find that program? The answer is your $PATH. $PATH is a pre-defined environment variable whose value is a list of directories.The shell looks for program names in that list, in the order the directories appear.
To determine where the shell will find a particular program, use the which command. Note that which tells you where it looked if it cannot find the program.
Using which to search $PATH
which rsync
which cat
which bwa # not yet available to youThe module system
The module system is an incredibly powerful way to have literally thousands of software packages available, some of which are incompatible with each other, without causing complete havoc. The TACC staff stages packages in well-known locations that are NOT on your $PATH. Then, when a module is loaded, its binaries are added to your $PATH.
For example, the following module load command makes the fastqc FASTQ file quality checking program available to you:
How module load affects $PATH
# first type "matlab" to show that it is not present in your environment:
matlab
# it's not on your $PATH either:
which matlab
# now add matlabto your environment and try again:
module load matlab
# and see how it's now on your $PATH:
which matlab
# you can see the new directory at the front of $PATH
echo $PATH
# to remove it, use "unload"
module unload matlab
matlab
# gone from $PATH again...
which matlabTACC BioContainers modules
It is quite a large systems administration task to install software at TACC and configure it for the module system. As a result, TACC was always behind in making important bioinformatics software available. To address this problem, TACC moved to providing bioinformatics software via containers, which are virtual machines like VMware and Virtual Box, but are lighter weight: they require less disk space because they rely more on the host's base Linux environment. Specifically, TACC (and many other High Performance Computing clusters) use Singularity containers, which are similar to Docker containers but are more suited to the HPC environment (in fact one can build a Docker container then easily convert it to Singularity for use at TACC).
TACC obtains its containers from BioContainers (https://biocontainers.pro/ and https://github.com/BioContainers/containers), a large public repository of bioinformatics tool Singularity containers. This has allowed TACC to easily provision thousands of such tools!
These BioContainers are not visible in TACC's "standard" module system, but only after the master biocontainers module is loaded. Once it has been loaded, you can search for your favorite bioinformatics program using module spider.
# Verify that samtools is not available
samtools
# Load the Biocontainers master module (this takes a while)
module load biocontainers
# Now look for these programs
module spider samtools
module spider Rstats
module spider kallisto
module spider bowtie2
module spider minimap2
module spider multiqc
module spider GATK
module spider velvetNotice how the BioContainers module names have "ctr" in their names, version numbers, and other identifying information.
The standard TACC module system has been phased out for bioinformatics programs, so always look for your application in BioContainers.
While it's great that there are now hundreds of programs available through BioContainers, the one drawback is that they can only be run on cluster nodes, not on login nodes. To test BioContainer program interactively, you will need to use TACC's idev command to obtain an interactive cluster node. More on this shortly...
loading a biocontainer module
Once the biocontainers module has been loaded, you can just module load the desired tool, as with the kallisto pseudo-aligner program below.
# Load the Biocontainers master module
module load biocontainers
# Verify kallisto is not yet available
kallisto
# Load the default kallisto biocontainer
module load kallisto
# Verify kallisto is not available (although not on login nodes)
kallistoNote that loading a BioContainer does not add anything to your $PATH. Instead, it defines an alias, which is just a shortcut for executing the command. You can see the alias definition using the type command. And you can ensure the program is available using the command -v utility.
# Note that kallisto has not been added to your $PATH, but instead has an alias
which kallisto
# Ensure kallisto is available with command -v
command -v kallistoinstalling custom software
Even with all the tools available at TACC, inevitably you'll need something they don't have. In this case you can build the tool yourself and install it in a local TACC directory. While building 3rd party tools is beyond the scope of this course, it's really not that hard. The trick is keeping it all organized.
For one thing, remember that your $HOME directory quota is fairly small (10 GB on ls6), and that can fill up quickly if you install many programs. We recommend creating an installation area in your $WORK directory and installing programs there. You can then make symbolic links to the binaries you need in your $HOME/local/bin directory (which was added to your $PATH in your .bashrc).
See how we used a similar trick to make the launcher_creator.py program available to you. Using the ls -l option shows you where symbolic links point to:
Real location of launcher_creator.py
ls -l ~/local/bin$PATH caveat
Remember that the order of locations in the $PATH environment variable is the order in which the locations will be searched. In particular, the (non-BioContainers) module load command adds to the front of your path. This can mask similarly-named programs, for example, in your $HOME/local/bin directory.
Job Execution
Job execution is controlled by the SLURM batch system on both stampede2 and ls6.
To run a job you prepare 2 files:
a commands file file containing the commands to run, one task per line (<job_name>.cmds)
a job control file that describes how to run the job (<job_name>.slurm)
The process of running the job involves these steps:
Create a commands file containing exactly one task per line.
Prepare a job control file for the commands file that describes how the job should be run.
You submit the job control file to the batch system. The job is then said to be queued to run.
The batch system prioritizes the job based on the number of compute nodes needed and the job run time requested.
When compute nodes become available, the job tasks (command lines in the <job_name>.cmds file) are assigned to one or more compute nodes and begin to run in parallel.
The job completes when either:
you cancel the job manually
all tasks in the job complete (successfully or not!)
the requested job run time has expired
SLURM at a glance
Here are the main components of the SLURM batch system.
| stampede2, ls5 |
|---|---|
batch system | SLURM |
batch control file name | < |
job submission command |
|
job monitoring command |
|
job stop command |
|
Simple example
Let's go through a simple example. Execute the following commands to copy a pre-made simple.cmds commands file:
Copy simple commands
mkdir -p $SCRATCH/core_ngs/slurm/simple
cd $SCRATCH/core_ngs/slurm/simple
cp $CORENGS/tacc/simple.cmds .What are the tasks we want to do? Each task corresponds to one line in the simple.cmds file, so let's take a look at it using the cat (concatenate) command that simply reads a file and writes each line of content to standard output (here, your Terminal):
View simple commands
cat simple.cmdsThe tasks we want to perform look like this:
sleep 5; echo "Command 1 on `hostname` - `date`" > cmd1.log 2>&1
sleep 5; echo "Command 2 on `hostname` - `date`" > cmd2.log 2>&1
sleep 5; echo "Command 3 on `hostname` - `date`" > cmd3.log 2>&1
sleep 5; echo "Command 4 on `hostname` - `date`" > cmd4.log 2>&1
sleep 5; echo "Command 5 on `hostname` - `date`" > cmd5.log 2>&1
sleep 5; echo "Command 6 on `hostname` - `date`" > cmd6.log 2>&1
sleep 5; echo "Command 7 on `hostname` - `date`" > cmd7.log 2>&1
sleep 5; echo "Command 8 on `hostname` - `date`" > cmd8.log 2>&1There are 8 tasks. Each is a simple echo command that just outputs string containing the task number and date to a different file after sleeping for 5 seconds. Notice that we can put two commands on one line if they are separated by a semicolon ( ; ).
Use the handy launcher_creator.py program to create the job submission script.
Create batch submission script for simple commands
launcher_creator.py -j simple.cmds -n simple -t 00:01:00 -a OTH21164 -q normalYou should see output something like the following, and you should see a simple.slurm batch submission file in the current directory.
Project simple.
Using job file simple.cmds.
Using normal queue.
For 00:01:00 time.
Using OTH21164 allocation.
Not sending start/stop email.
Launcher successfully created. Type "sbatch simple.slurm" to queue your job.Submit your batch job like this, then check the batch queue to see the job's status.
Submit simple job to batch queue
sbatch --reservation CoreNGSday2 simple.slurm
showq -u
# Output looks something like this:
-------------------------------------------------------------
Welcome to the Lonestar6 Supercomputer
-------------------------------------------------------------
--> Verifying valid submit host (login1)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Checking available allocation (OTH21164)...OK
Submitted batch job 232542
The queue status will show your job as ACTIVE while its running, or WAITING if not.
SUMMARY OF JOBS FOR USER: <abattenh>
ACTIVE JOBS--------------------
JOBID JOBNAME USERNAME STATE NODES REMAINING STARTTIME
================================================================================
232542 simple abattenh Running 1 0:00:54 Thu Jun 9 11:30:18
WAITING JOBS------------------------
JOBID JOBNAME USERNAME STATE NODES WCLIMIT QUEUETIME
================================================================================
Total Jobs: 1 Active Jobs: 1 Idle Jobs: 0 Blocked Jobs: 0If you don't see your simple job in either the ACTIVE or WAITING sections of your queue, it probably already finished – it should only run for a few seconds!
Notice in my queue status, where the STATE is Running, there is only one node assigned. Why is this, since there were 8 tasks?
Every job, no matter how few tasks requested, will be assigned at least one node. Each tlonestar6 node has 128 physical cores, so each of the 8 tasks can be assigned to a core.
Exercise: What files were created by your job?
filename wildcarding
You can look at one of the output log files like this:
Multi-character filename wildcarding
cat cmd1.logBut here's a cute trick for viewing the contents all your output files at once, using the cat command and filename wildcarding.
Multi-character filename wildcarding
cat cmd*.logThe cat command actually takes a list of one or more files (if you're giving it files rather than standard input – more on this shortly) and outputs the concatenation of them to standard output. The asterisk ( * ) in cmd*.log is a multi-character wildcard that matches any filename starting with cmd then ending with .log. So it would match cmd_hello_world.log.
You can also specify single-character matches inside brackets ( [ ] ) in either of the ways below, this time using the ls command so you can better see what is matching:
Single character filename wildcarding
ls cmd[1234].log
ls cmd[2-6].logThis technique is sometimes called filename globbing, and the pattern a glob. Don't ask me why – it's a Unix thing. Globbing – translating a glob pattern into a list of files – is one of the handy thing the bash shell does for you. (Read more about Wildcards and special filenames.)
Exercise: How would you list all files starting with simple?
Here's what my cat output looks like. Notice the times are all nearly the same because all the tasks ran in parallel. That's the power of cluster computing!
Command 1 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:39 CDT 2022
Command 2 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:33 CDT 2022
Command 3 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:33 CDT 2022
Command 4 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:38 CDT 2022
Command 5 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:40 CDT 2022
Command 6 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:38 CDT 2022
Command 7 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:33 CDT 2022
Command 8 on c305-005.ls6.tacc.utexas.edu - Thu Jun 9 11:30:39 CDT 2022echo
Lets take a closer look at a typical task in the simple.cmds file.
An echo command
sleep 5; echo "Command 3 `date`" > cmd3.log 2>&1The echo command is like a print statement in the bash shell. Echo takes its arguments and writes them to one line of standard output. While not always required, it is a good idea to put echo's output string in double quotes.
backtick evaluation
So what is this funny looking `date` bit doing? Well, date is just another Linux command (try just typing it in). Here we don't want the shell to put the string "date" in the output, we want it to execute the date command and put the result text into the output. The backquotes ( ` ` also called backticks) around the date command tell the shell we want that command executed and its output substituted into the string. (Read more about Quoting in the shell.)
Backtick evaluation
# These are equivalent:
date
echo `date`
# But different from this:
echo dateoutput redirection
There's still more to learn from one of our simple tasks, something called output redirection:
sleep 5; echo "Command 3 `date`" > cmd3.log 2>&1Normally echo writes its string to standard output. If you invoke echo in an interactive shell like Terminal, standard output is displayed to the Terminal window.
All outputs generated by tasks in your batch job are directed to one output and error file per job. Here they have names like simple.e2916562 and simple.o2916562; simple.o2916562 contains all standard output and simple.o2916562 contains all standard error generated by your tasks that was not redirected elsewhere, as well as information relating to running your job and its tasks. For large jobs with complex tasks, it is not easy to troubleshoot execution problems using these files.
So we usually we want to separate the outputs of all our tasks into individual log files, one per task. Why is this important? Suppose we run a job with 100 commands, each one a whole pipeline (alignment, for example). 88 finish fine but 12 do not. Just try figuring out which ones had the errors, and where the errors occurred, if all the standard output is in one intermingled file and all standard error in the other intermingled file!
So in the above example the first '>' says to redirect the standard output of the echo command to the cmd3.log file. The '2>&1' part says to redirect standard error to the same place. Technically, it says to redirect standard error (built-in Linux stream 2) to the same place as standard output (built-in Linux stream 1); and since standard output is going to cmd3.log, any standard error will go there also. (Read more about Standard I/O streams.)
Job parameters
Now that we've executed a really simple job, let's take a look at some important job submission parameters. These correspond to arguments to the launcher_creator.py script.
A bit of background. Historically, TACC was set up to cater to researchers writing their own C or Fortran codes highly optimized to exploit parallelism (the HPC crowd). Much of TACC's documentation is aimed at this audience, which makes it difficult to pick out the important parts for us.
The kind of jobs we biologists generally run are relatively new to TACC. They even have a special name for them: "parametric sweeps", by which they mean the same program running on different data sets.
In fact there is a special software module required to run our jobs, called the launcher module. You don't need to worry about activating the launcher module – that's done by the <job_name>.slurm script created by launcher_creator.py like this:
module load launcherThe launcher module knows how to interpret various job parameters in the <job_name>.slurm batch SLURM submission script and use them to create your job and assign its tasks to compute nodes. Our launcher_creator.py program is a simple Python script that lets you specify job parameters and writes out a valid <job_name>.slurm submission script.
launcher_creator.py
If you call launcher_creator.py with no arguments it gives you its usage description. Because it is a long help message, we may want to pipe the output to more, a pager that displays one screen of text at a time. Type the spacebar to advance to the next page, and Ctrl-c to exit from more.
Get usage information for launcher_creator.py
# Use spacebar to page forward; Ctrl-c to exit
launcher_creator.py | morelauncher_creator.py usage
usage: launcher_creator.py [-h] -n NAME -t TIME_REQUEST [-j JOB_FILE]
[-b SHELL_COMMANDS] [-B SHELL_COMMANDS_FILE]
[-q QUEUE] [-a [ALLOCATION]] [-m MODULES]
[-M MODULES_FILE] [-w WAYNESS] [-N NUM_NODES]
[-e [EMAIL]] [-l LAUNCHER] [-s]
Create launchers for TACC clusters. Report problems to rt-
other@ccbb.utexas.edu
optional arguments:
-h, --help show this help message and exit
Required:
-n NAME, --name NAME The name of your job.
-t TIME_REQUEST, --time TIME_REQUEST
The time you want to give to your job. Format:
hh:mm:ss
Commands:
You must use at least one of these options to submit your commands for
TACC.
-j JOB_FILE, --jobs JOB_FILE
The name of the job file containing your commands.
-b SHELL_COMMANDS, --bash SHELL_COMMANDS
A string of shell (Bash, zsh, etc) commands that are
executed before any parametric jobs are launched.
-B SHELL_COMMANDS_FILE, --bash_file SHELL_COMMANDS_FILE
A file containing shell (Bash, zsh, etc) commands that
are executed before any parametric jobs are launched.
Optional:
-q QUEUE, --queue QUEUE
The TACC allocation for job submission.
Default="development"
-a [ALLOCATION], -A [ALLOCATION], --allocation [ALLOCATION]
The TACC allocation for job submission. You can set a
default ALLOCATION environment variable.
-m MODULES, --modules MODULES
A list of module commands. The "launcher" module is
always automatically included. Example: -m "module
swap intel gcc; module load bedtools"
-M MODULES_FILE, --modules_file MODULES_FILE
A file containing module commands.
-w WAYNESS, --wayness WAYNESS
Wayness: the number of commands you want to give each
node. The default is the number of cores per node.
-N NUM_NODES, --num_nodes NUM_NODES
Number of nodes to request. You probably don't need
this option. Use wayness instead. You ONLY need it if
you want to run a job list that isn't defined at the
time you submit the launcher.
-e [EMAIL], --email [EMAIL]
Your email address if you want to receive an email
from Lonestar when your job starts and ends. Without
an argument, it will use a default EMAIL_ADDRESS
environment variable.
-l LAUNCHER, --launcher_name LAUNCHER
The name of the launcher script that will be created.
Default="<name>.slurm"
-s Echoes the launcher filename to stdout.job name and commands file
Recall how the simple.slurm batch file was created:
Create batch submission script for simple commands
launcher_creator.py -j simple.cmds -n simple -t 00:01:00 -a OTH21164 -q normalThe name of your commands file is given with the -j simple.cmds argument.
Your desired job name is given with the -n <job_name> argument.
The <job_name> (here simple) is the job name you will see in your queue.
By default a corresponding <job_name>.slurm batch file is created for you.
It contains the name of the commands file that the batch system will execute.