Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Create a commands file containing exactly one task per line.
  2. Prepare a job control file for the commands file that describes how the job should be run.
  3. You submit the job control file to the batch system.
    1. The job is then said to be queued to run.
  4. The batch system prioritizes the job based on the number of compute nodes needed and the job run time requested.
  5. When compute nodes become available, the job tasks (command lines in the <job_name>.cmds file) are assigned to one or more compute nodes and begin to run in parallel.
  6. The job completes when either:
    1. you cancel the job manually
    2. all job tasks complete (successfully or not!)
    3. the requested job run time has expired

...

Here are the main components of the SLURM batch system.


ls6, stampede3, ls5
batch systemSLURM
batch control file name<job_name>.slurm
job submission commandsbatch <job_name>.slurm
job monitoring commandshowq -u
job stop commandscancel -n <job name>

...

Let's go through a simple example. Execute the following commands to copy a pre-made premade simple.cmds commands file:

...

There are 8 tasks. Each task sleeps for 5 seconds, then uses the echo command to output a string containing the task number and date to a log file named for the task number. Notice that we can put two commands on one line if they are separated by a semicolon ( ; ).

You can count the number of lines in the simple.cmds file using the wc (word count) command with the -l (lines) option:

Code Block
languagebash
titleCount file lines
wc -l simple.cmds

Use the handy launcher_creator.py program to create the job control file.

Code Block
languagebash
titleCreate batch submission script for simple commands
launcher_creator.py -j simple.cmds -n simple -t 00:01:00 -a TRA23004OTH21164 -q development

You should see output something like the following, and you should see a simple.slurm batch submission file in the current directory.

Code Block
Project simple.
Using job file simple.cmds.
Using development queue.
For 00:01:00 time.
Using TRA23004OTH21164 allocation.
Not sending start/stop email.
Launcher successfully created. Type "sbatch simple.slurm" to queue your job.

...

Code Block
languagebash
titleSubmit simple job to batch queue
sbatch simple.slurm 
showq -u

# Output looks something like this:
-------------------------------------------------------------
          Welcome to the Lonestar6 Supercomputer
-------------------------------------------------------------
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normaldevelopment)...OK
--> Checking available allocation (TRA23004)...OK
Submitted batch job 1722779

...

> Checking available allocation (OTH21164)...OK
Submitted batch job 2411919

The queue status will initially show your job as WAITING until a node becomes available:

Code Block
SUMMARY OF JOBS FOR USER: <abattenh>

ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   NODES REMAINING STARTTIME
================================================================================

WAITING JOBS------------------------
JOBID     JOBNAME    USERNAME      STATE   NODES WCLIMIT   QUEUETIME
================================================================================
2411919   simple     abattenh      Waiting 1     0:01:00   Wed May 28 15:39:24

Total Jobs: 1     Active Jobs: 0     Idle Jobs: 1     Blocked Jobs: 0

Once your job is ACTIVE (running) you'll see something like this:

Code Block
SUMMARY OF JOBS FOR USER: <abattenh>

ACTIVE JOBS--------------------
JOBID     JOBNAME    USERNAME      STATE   NODES REMAINING STARTTIME
================================================================================
1722779   simple     abattenh      Running 1      0:00:39  Sat Jun  1 21:55:28

WAITING JOBS------------------------
JOBID     JOBNAME    USERNAME      STATE   NODES WCLIMIT   QUEUETIME
================================================================================

Total Jobs: 1     Active Jobs: 1     Idle Jobs: 0     Blocked Jobs: 0

...

Every job, no matter how few tasks requested, will be assigned at least one node. Each lonestar6 node has 128 physical cores, so each of the 8 tasks can be assigned to a different core.

...

The echo command is like a print statement in the bash shell.  echo takes its arguments and writes them to standard output. While not always required, it is a good idea to put echo's output string in double quotes ( " ).

backtick evaluation

So what is this funny looking `date` bit doing? Well, date is just another Linux command (try just typing it in) that just displays the current date and time. Here we don't want the shell to put the string "date" in the output, we want it to execute the date command and put the result text into the output. The backquotes ( ` ` ) also called backticks) , around the date command tell the shell we want that command executed and its standard output substituted into the string. (Read more about Quoting in the shell)

...

The '2>&1' part says to redirect standard error to the same place. Technically, it says to redirect standard error (built-in Linux stream 2) to the same place as standard output (built-in Linux stream 1); and since . Since standard output is going to cmd3.log, any standard error will go there also. (Read more about Standard streams and redirection)

When the TACC batch system runs a job, all outputs generated by tasks in the batch job are directed to one output and error file per job. Here they have names like simple.e924965e2411919 and simple.o924965o24119195.

  • simple.

...

  • o2411919 contains all standard output

...

  • generated by your task that was not redirected elsewhere
  • simple.

...

  • e2411919 contains all standard error generated by your tasks that was not redirected elsewhere

...

  • both also contain information relating to running your job and its tasks

...

For large jobs with complex tasks, it is not easy to troubleshoot execution problems using these files.

...

The launcher module knows how to interpret various job parameters in the <job_name>.slurm batch SLURM submission script and use them to create your job and assign its tasks to compute nodes. Our launcher_creator.py program is a simple Python script that lets you specify job parameters and writes out a valid <job_name>.slurm submission script.

...

Code Block
languagebash
titleGet usage information for launcher_creator.py
# Use spacebar to page forward; Ctrl-c or q to exit
launcher_creator.py | more

...

Code Block
languagebash
titleCreate batch submission script for simple commands
launcher_creator.py -j simple.cmds -n simple -t 00:01:00 -a TRA23004OTH21164 -q development
  • The name of your commands file is given with the -j simple.cmds option.
  • Your desired job name is given with the -n simple option.
    • The <job_name> (here simple) is the job name you will see in your queue.
    • By default a corresponding <job_name>.slurm batch file is created for you.
      • It contains the name of the commands file that the batch system will execute.

...

queue namemaximum runtimepurpose
development2 hrs

development/testing and short jobs

(

typically has short queue wait times

)

normal48 hrs

normal jobs

(

queue waits are often quite long

)

  • In launcher_creator.py, the queue is specified by the -q argument.
    • The default queue is development. Specify -q normal for normal queue jobs.
  • The maximum runtime you are requesting for your job is specified by the -t argument.
    • Format is hh:mm:ss
    • Note that your job will be terminated without warning at the end of its time limit!

...

  • You specify that allocation name with the -a argument of launcher_creator.py.
  • If you have set an $ALLOCATION environment variable to an allocation name, that allocation will be used.

Expand
titleOur class ALLOCATION was set in .bashrc

The .bashrc login script you've installed for this course specifies the class's allocation as shown below. Note that this allocation will expire after the course, so you should change that setting appropriately at some point.

Code Block
languagebash
titleALLOCATION setting in .bashrc
# This sets the default project allocation for launcher_creator.py
export ALLOCATION=TRA23004OTH21164


  • When you run a batch job, your project allocation gets "charged" for the time your job runs, in the currency of SUs (System Units).
  • SUs are related in some way to node hours, usually 1 SU = 1 "standard" node hour.

...

Code Block
languagebash
titleCreate batch submission script for wayness example
launcher_creator.py -j wayness.cmds -n wayness -w 4 -t 00:02:00 -a TRA23004OTH21164 -q development
sbatch wayness.slurm
showq -u

...

Code Block
languagebash
cat cmd*log

# or, for a listing ordered by command number (the 2nd space-separated field)
cat cmd*log | sort -k 2k2,2n

The vertical bar ( | ) above is the pipe operator, which connects one program's standard output to the next program's standard input.

...