Table of Contents |
---|
...
POD name | Description | BRCF delegates | Compute servers | Storage server | Unix Groups | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
AMD GPU PODpod | PUD Pod with GPU resources available for instructional and qualifying research use. Note: This POD pod uses UT EID authentication | Anna Battenhouse |
| amdbstor01.ccbb.utexas.edu
| Per course and research project. See | ||||||
BIC pod | Pod for the Biomedical Imaging Core facility in the CBRS | Cici Cumba | bicfcomp01.ccbb.utexas.edu
| bicfstor01.ccbb.utexas.edu
| 6
| 72
| 42
| BIC | |||
CBRS POD pod | Shared POD pod for CBRS core facilities | Anna Battenhouse |
| cbrsstor01.ccbb.utexas.edu
| BCG, CBRS_BIC, CBRS_CryoEM, CBRS_microscopy, CBRS_org, CBRS_proteomics | ||||||
Chen/Wallingford/Raccah PODpod | Shared POD pod for members of the Jeffrey Chen, John Wallingford and Doran Raccah labs |
| chenstor01.ccbb.utexas.edu
| Chen, Raccah, Wallingford | |||||||
Dickinson/Cambronne POD pod | Shared POD pod for members of the Dan Dickinson and Lulu Cambronne labs |
|
| djdistor01.ccbb.utexas.edu
| Dickinson, Cambronne | ||||||
Educational (EDU) PODpod | Dedicated instructional PODpod Note: This POD pod uses UT EID authentication | Course instructors. |
| educstor01.ccbb.utexas.edu
| Per course. See The Educational PODs | ||||||
Georgiou/WCAAR POD pod | Shared POD pod for members of the Georgiou lab and the Waggoner Center for Alcoholism & Addiction Research (WCAAR) |
|
| georstor01.ccbb.utexas.edu
| Georgiou, WCAAR | ||||||
GSAF POD pod
| Shared POD pod for use by GSAF customers. 2TB Work area allocation available for participating groups. Contact Anna Battenhouse , for more information. |
|
| gsafstor01.ccbb.utexas.edu
| GSAF customer groups: GSAF internal & instructional groups: | ||||||
Hopefog (Ellington) PODpod | Shared POD pod for Ellington & Marcotte lab special projects |
|
| hfogstor01.ccbb.utexas.edu
| Ellington, Marcotte, Wilke | ||||||
Iyer/Kim POD pod | Shared POD pod for members of the Vishy Iyer and Jonghwan and Jonghwan Kim labs |
|
| iyerstor01.ccbb.utexas.edu
| Iyer, JKim | ||||||
Kirkpatrick PODpod | Shared POD pod for members of Kirkpatrick and Harpak labs | TBD |
| kirkstor01.ccbb.utexas.edu
| Kirkpatrick, Harpak | ||||||
Lambowitz /CCBB POD pod | Shared POD pod for use by CCBB affiliates and the Alan Lambowitz lab. |
|
| lambstor01.ccbb.utexas.edu
| Lambowitz groups: CCBB groups: Instructional groups: | ||||||
LiveStrong DT POD pod | POD Pod for members of Dell Medical School's LiveStrong Diagnostic Theraputics group. Note: This POD uses UT EID authentication |
|
| livestor01.ccbb.utexas.edu
| Jeanne Kowalski groups: Lauren Ehrlich groups: Other groups: Instructional groups: | ||||||
Marcotte POD pod | Single-lab POD pod for members of the Edward Marcotte lab |
|
| marcstor02.ccbb.utexas.edu
| Marcotte | ||||||
Ochman/Moran POD pod | Shared POD pod for members of the Howard Ochman and Nancy Moran labs |
|
| ochmstor01.ccbb.utexas.edu
| Ochman, Moran | ||||||
Rental POD pod | Shared POD pod for POD pod rental customers |
|
| rentstor01.ccbb.utexas.edu
| Brock, Calder, Champagne, Curley, Fleet, Fonken, Gore, Gross, Hillis, Lopez, Nguyen, Raccah, Seidlits, Sullivan, YiLu, Zamudio | ||||||
Wilke POD pod | For use by members of the Claus Wilke lab and the AG3C collaboration |
|
| wilkstor01.ccbb.utexas.edu
| Wilke |
...
Shared Work areas are backed up weekly. Scratch areas are not backed up. Both Work and Scratch areas may have quotas, depending on the POD (e.g. on the Rental or GSAF pod); such quotas are generally in the multi-terabyte range.
Because it has a large quota and is regularly backed up and archived, your group's Work area is where large research artifacts that need to be preserved should be located.
Scratch, on the other hand, can be used for artifacts that are transient or can easily be re-created (such as downloads from public databases).
...
Note that any directory in any file system tree named tmp, temp, or backups is not backed up. Directories with these names are intended for temporary files, especially large numbers of small temporary files. See "Cannot create tempfile" error and Avoid having too many small files.
...
What is too many? Ten million or more.
If the files are small, they don't take up much storage space. But the fact that there are so many causes the backup or archiving to run for a really long time. For weekly backups, this can mean that the previous week's backup is not done by the time the next one starts. For archiving, it means it can take weeks on end to archive a single directory that has many millions of small files.
Backing up gets even worse when a directory with many files is just moved or renamed. In this case the files need to be deleted from the old location and added to the new one – and both of these operations can be extremely long-running.
To see how many files (termed "inodes" in Unix) there are under a directory tree, use the df -i command. For example:
Code Block | ||
---|---|---|
| ||
df -i /stor/work/MyGroup/my_dir |
...
1) Move the files to a temporary directory.
The backup process excludes any sub-directory anywhere in the file system directory tree named tmp, temp, or backups. So if there are files you don't care about, just rename the directory to, for example, tmp. There will be a one-time deletion of the directory under its previous name, but that would be it.
...
3) Zip or Tar the directory
If these are important files you need to have backed up, ziping or taring the directory is the way to go. This converts a directory and all its contents into a single, larger file that can be backed up or archived efficiently. Please Contact Us if you would like us to help with this, since with our direct access to the storage server we can perform zip and tar operations much more efficiently than you can from a compute server.
If your analysis pipeline creates many small files as a matter of course, you should consider modifying the processing to create small files in a tmp directory then ziping or taring the as a final step.
Memory usage considerations
...
And in a pathological (but unfortunately not uncommon) pattern, a program (or programs) that need more memory than available can cause "thrashing" where swapping in and out of RAM is happening continuously. This will bring a computer to its knees, making it virtually impossible to do anything on it (slow logins, or logins timing out; any simple command just "hanging" for a long time or never returning). Note that when each processes a user starts itself spawns multiple threads, as described at Do not run too many processes, this situation can happen. We monitor system usage, and will intervene when we see this happen, by termininating the offending process(es) if possible, or by rebooting the compute server if not.
...
Many programs offer an option to divide their work among multiple processes, which can reduce the total clock time the program will run. The option may refer to "processes", "cores" or "threads", but actually target the available computing units on a server. Examples include: samtools sort --threads option; bowtie2 -p/--threads option; in R, library(doParallel)
; registerDoParallel(cores =
NN)
and the OMP_NUM_THREADS environment variable for OpenMP programs.
A "computing unit" is a server's cores and hyperthreads, and it is important to keep in mind the difference between the two. Cores are physical computing units, while hyperthreads are virtual computing units – kernel objects that "split" each core into two hyperthreads so that the single compute unit can be used by two processes.
The POD Resources and Access: AvailablePODs table describes the compute servers that are associated with each BRCF pod, along with their available cores and (hyper)threads. (Note that most servers are dual-CPU, meaning that total core count is double the per-CPU core count, so a dual 4-core CPU machine would have 8 cores.) You can also see the hyperthread and core counts on any server via:
Code Block | ||
---|---|---|
| ||
cat /proc/cpuinfo | grep -c 'core id' # actually the number of hyperthreads! cat /proc/cpuinfo | grep 'siblings' | head -1 # the real number of physical cores |
(Yes, the fact that 'core id' gives hyperthreads and 'siblings' the number of cores is confusing. But what do you expect -- this is Unix )
Since hyperthreads look like available computing units ("CPUs in OS displays), parallel processing options that detect "cores" usually really detect hyperthreads. Why does this matter?
...
So before you select a process/core/thread count for your program, consider whether it will perform significant I/O. If so, you can specify a higher count. If it is compute bound (e.g. machine learning), be sure to specify a count low enough to leave free hyperthreads for others to use.
Note that the issue with machine learning (ML) workflows being incredibly compute bound is the main reason ML processing is best run on GPU-enabled servers, either at TACC or on one of the BRCF pods with GPUs (see BRCF GPU servers).
Do not run too many processes
Having described how to run multiple processes, it is important that you do not run too many processes at a time, because you are just using one compute server, and you're not the only one using the machine!
Note that starting a single instance of a program can sometimes spawn many threads. For example, Each instance of an OpenMP program by default will use all available threads. To avoid this with OpenMP, set the OMP_NUM_THREAD environment variable (e.g. export OMP_NUM_THREADS=1). However it is important to check the documentation for the particular program being used, and also use top (press the 1 key to see per-hyperthread load) or htop to see how many threads a single instance of a program uses.
How many is "too many"? That really depends on what kind of job it is, what compute/input-output mix it has, and how much RAM it needs. As a general rule, don't run more simultaneous jobs on a POD compute server than you would run on a single TACC compute node.
Before running mutiple jobs, you should check RAM usage (free -g will show usage in GB) and see what is already running using the top program (press the 1 key to see per-hyperthread load), or using the who command, or with a command like this:
Code Block | ||
---|---|---|
| ||
ps -ef | grep -v root | grep -v bash | grep -v sshd | grep -v screen | grep -v tmux | grep -v 'www-data' |
Here is a good article on all the aspects of the top command: https://www.booleanworld.com/guide-linux-top-command/
Finally, be sure to lower the priority of your processes using renice as described below (e.g. renice -n 15 -u `whoami`).
Lower priority for large, long-running jobs
If you have one or more jobs that uses multiple threads, or does significant I/O, its execution can affect system responsiveness for other users.
To help avoid this, please use the renice tool to manipulate the priority of your tasks (a priority of 15 is a good choice). It's easy to do, and here's a quick tutorial: http://www.thegeekstuff.com/2013/08/nice-renice-command-examples/?utm_source=tuicool
For example, before you start any tasks, you can set the default priority to nice 15 as shown here. Anything you start from then on (from this shell) should inherit the nice 15 value.
Code Block | ||
---|---|---|
| ||
renice +15 $$ |
...
Running processes unattended
While POD compute servers do not have a batch system, you can still run multiple tasks simultaneously in several different ways.
For example, you can use terminal multiplexer tools like screen or tmux to create virtual terminal sessions that won't go away when you log off. Then, inside a screen or tmux session you can create multiple sub-shells where you can run different commands.
You can also use the command line utility nohup to start processes in the background, again allowing you to log off and still have the process running.
Here are some links on how to use these tools:
...