Submitting Jobs on Lonestar5 and Stampede2

Overview

NB: For complete up-to-date information, always see: TACC's Lonestar5 User Guide (https://portal.tacc.utexas.edu/user-guides/lonestar5) and TACC's Stampede User Guide (https://portal.tacc.utexas.edu/user-guides/stampede)

The main point of using lonestar5 or Stampede2 is that they are massive computer clusters. However, to use it properly, you need to know how to properly submit jobs to run on the cluster.

You normally access lonestar5 or stampede2 by logging into one of the "head" or "login" nodes at TACC.  If we run a command (in the normal Linux/Unix way) when logged into this way, we are running it on one of the low memory, low power  "head" or "login" nodes at TACC. When we do serious computations that are going to take more than a few minutes or use a lot of RAM, we need to submit them to the cluster rather then the head/login nodes.

Creating a Launcher Script

To do so, you first need to create a launcher script.  TACC has supplied a sample launcher script which we can modify to queue and execute our job. Here's how to create a launcher script:

module load launcher
cp $LAUNCHER_DIR/extras/batch-scripts/launcher.slurm .
nano launcher.slurm

You can use any editor to edit the file, above we used the "nano" editor, but use any editor you are familiar with.  Typically, you will want to make some changes to this default launcher.  These changes are made with your text editor, and involving changing or adding comment lines in the file that start with "#SBATCH ".

The "#SBATCH -N line (if added) specifies the name of the job. 

The "#SBATCH -o" line specifies the names of the output files that the job creates.  It might make sense to change the prefix to be the same as the name of this job.

The "#SBATCH -t" line specifies the length of time given to run the job. The more time we give our job, the longer in the queue our job will wait to be run. When the time is up, the job will terminate whether or not it is finished. So it's best to give our job slightly more time than you think it will take.

We can also, optionally, add a few additional lines to have it send an email to your email address when the job starts and finishes.

To do that, under -V, we would add 2 new lines like so:

#SBATCH -M my_email@example.com
#SBATCH -m be

Also, if we are part of multiple allocations, we'll need to specify which allocation to use (NB: The allocation is case sensitive).

#SBATCH -A UT-2015-05-18

Lastly, we need to specify the job file.  Here we assume it is called "commands".

setenv LAUNCHER_JOB_FILE commands


Submitting the Job

The next step is to submit the job to the queue using the launcher file you created.

sbatch launcher.slurm

The system will make sure that everything specified in the launcher file  is correct and if it is, the job will be queued.

To check the status of the job, the command is:

showq -u <username>

This will tell you its job priority and what state it is in.

If the job is in the list of "waiting jobs", this means the job has been queued and is waiting to start.

If the job is in the list of  "active jobs", this means the job is running.


JOBIDjob id assigned to the job
USERuser that owns the job
STATEcurrent job status, including, but not limited to:
CD(completed)
CF(cancelled)
F(failed)
PD(pending)
R(running)


In case we notice something wrong with the job, we can delete it like so:

scancel job-ID

To obtain the job-ID, look at the  "showq" output.

TACC Output Files

While your job is running, TACC creates 3 different files with names based on the -o field in the launcher. These files are named like so:

(job_name).e(job-ID)
(job_name).pe(job-ID)
(job_name).o(job-ID)

These files have the output of your job that would have been sent to standard output or standard error and messages from TACC about your job. These files will be useful to determine what happened if your job fails.