/
Using TACC's Job Submission System (and 2017 end of class review)

Using TACC's Job Submission System (and 2017 end of class review)

Table of Contents

 

Introduction:

Throughout the course we have had you running anything of substance (ie programs and scripts) on iDev nodes. This was done in large part thanks to the availability of the reservation system which allowed you to access an iDev node without having to wait. In previous years tutorials were planned around a:

  • "hurry up and get the job started its going to sit for some amount of time in the que"
  • "what can we tell them while they wait for their job to run" 
  • "DRAT! there is a typo in their command's file they have to edit and go back to the end of the que". 

Together this has enabled each of you to accomplish more tutorials than previous years while hopefully learning more.

Objectives:

  1. Familiarize yourself with TACC's job submission system.
  2. Tidy up some other loose ends from the course.

Running jobs on TACC

Understanding "jobs" and compute nodes.

When you log into lonestar using ssh you are connected to what is known as the login node or "the head node". There are several different head nodes, but they are shared by everyone that is logged into lonestar (not just in this class, or from campus, or even from texas, but everywhere in the world). Anything you type onto the command line has to be executed by the head node. The longer something takes to complete, or the more it will slow down you and everybody else. Get enough people running large jobs on the head node all at once (say a classroom full of Big Data in Biology summer school students) and lonestar can actually crash leaving nobody able to execute commands or even log in for minutes -> hours -> even days if something goes really wrong. To try to avoid crashes, TACC tries to monitor things and proactively stop things before they get too out of hand. If you guess wrong on if something should be run on the head node, you may eventually see a message like the one pasted below. If you do, its not the end of the world, but repeated messages will become revoked TACC access and emails where you have to explain what you are doing to TACC and your PI and how you are going to fix it and avoid it in the future.  

Example of how you learn you shouldn't have been on the head node
Message from root@login1.ls4.tacc.utexas.edu on pts/127 at 09:16 ...
Please do not run scripts or programs that require more than a few minutes of
CPU time on the login nodes.  Your current running process below has been
killed and must be submitted to the queues, for usage policy see
http://www.tacc.utexas.edu/user-services/usage-policies/
If you have any questions regarding this, please submit a consulting ticket.

So you may be asking yourself what the point of using lonestar is at all if it is wrought with so many issues. The answer comes in the form of compute nodes. There are 1,252 compute nodes that can only be accessed by a single person for a specified amount of time. These compute nodes are divided into different queues called: normal, development, largemem, etc. Access to nodes (regardless of what queue they are in) is controlled by a "Queue Manager" program. You can personify the Queue Manager program as: Heimdall in Thor, a more polite version of Gandalf in lord of the rings when dealing with with the balrog, the troll from the billy goats gruff tail, or any other "gatekeeper" type. Regardless of how nerdy your personification choice is, the Queue Manager has an interesting caveat: you can only interact with it using the  sbatch command. "sbatch <filename.slurm>" tells the que manager to run a set job based on information in filename.slurm (i.e. how many nodes you need, how long you need them for, how to charge your allocation, etc). The Queue manager doesn't care WHAT you are running, only HOW to find what you are running (which is specified by a setenv CONTROL_FILE commands line in your filename.slurm file). The WHAT is then handled by the file "commands" which contains what you would normally type into the command line to make things happen.

Further sbatch reading

The following are the options available for the sbatch command file (note it may be helpful to close the table of contents on the left side of the window to better see the table)