Content Comparison

Table of Contents

...

Tip
Anyone with access to a POD may use *any* of the available compute servers, regardless of the server names. For example, both Georgiou and WCAAR users can access wcarcomp01 and wcarcomp02, and both Lambowitz and CCBB users can access lambcomp02 and ccbbcomp02.

POD name

Description

BRCF delegates

Compute servers

Storage server

Unix Groups

AMD GPU pod

Pod with GPU resources available for instructional and qualifying research use

Note: This pod uses UT EID authentication

Anna Battenhouse

amdgcomp01.ccbb.utexas.edu, amdgcomp02.ccbb.utexas.edu, amdgcomp03.ccbb.utexas.edu
- Dual 64-core EPYC 7V13 CPUs
- 512 GB RAM
- 8 AMD Radeon Instinct MI-100 GPUs w/32GB onboard RAM each

amdbstor01

amdgstor01.ccbb.utexas.edu

12 6-TB disks
72 TB raw, 42 TB usable

Per course and research project. See

The Educational PODs

BIC pod

Pod for the Biomedical Imaging Core facility in the CBRS

Cici Cumba

bicfcomp01.ccbb.utexas.edu

Dell PowerEdge R660xs
dual 28-core/56-thread CPUs
768 GB RAM
1.9 TB SATA SSD for ultra-high-speed local I/O, mounted as /ssd1 (not backed up)

bicfstor01.ccbb.utexas.edu

12 18-TB disks
216 TB raw, 128 TB usable

BIC

CBRS pod

Shared pod for CBRS core facilities

Anna Battenhouse

cbrscomp01.ccbb.utexas.edu,
cbrscomp02.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 26-core/52-thread CPUs
- 768 GB RAM
- 960 GB SATA SSD for ultra-high-speed local I/O, mounted as /ssd1 (not backed up)

cbrsstor01.ccbb.utexas.edu

24

30 16-TB disks

384

480 TB raw,

220

285 TB usable	BCG, CBRS_BIC, CBRS_CryoEM, CBRS_microscopy, CBRS_org, CBRS_proteomics
Chen/Wallingford

/Raccah

pod

Shared pod for members of the Jeffrey Chen, John Wallingford and Doran Raccah labs

chencomp01.ccbb.utexas.edu
- Dell PowerEdge R410
- dual 4-core/8-thread CPUs
- 64 GB RAM

chencomp02

Dell AMD node
dual 64-core/128-thread AMD EPYC CPUs
768 GB RAM
1.9 GB NVMe for ultra-high-speed local I/O, mounted as /ssd1 (not backed up)

chenstor01.ccbb.utexas.edu

24 8-TB

chenstor01.ccbb.utexas.edu

chencomp03.ccbb.utexas.edu

24 8-TB disks
192 TB raw, 106 TB usable

Chen, Raccah, Wallingford

Dickinson/Cambronne pod

Shared pod for members of the Dan Dickinson and Lulu Cambronne labs

Dan Dickinson
Lulu Cambronne

djdicomp01.ccbb.utexas.edu
- Dell PowerEdge R410
- dual 4-core/8-thread CPUs
- 64 GB RAM

djdistor01.ccbb.utexas.edu

24 8-TB disks
192 TB raw, 106 TB usable

Dickinson, Cambronne

Educational (EDU) pod

Dedicated instructional pod

Note: This pod uses UT EID authentication

Course instructors.

See The Educational PODs

edupod.cns.utexas.edu
- virtual host for pool of 3 physical servers listed below
educcomp01.ccbb.utexas.edu
educcomp02.ccbb.utexas.edu
educcomp04.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 28-core/52-thread CPUs
- 1 TB RAM

educstor01.ccbb.utexas.edu

24 4

12 8-TB disks
96 TB raw, 53 TB usable

Per course. See The Educational PODs

Georgiou/WCAAR pod

Shared pod for members of the Georgiou lab and the Waggoner Center for Alcoholism & Addiction Research (WCAAR)

Russ Durrett (Georgiou lab)

Dayne Mayfield (WCAAR)

wcarcomp01.ccbb.utexas.edu
- Dell PowerEdge R430
- dual 16-core/32-thread CPUs
- 256 GB RAM
wcarcomp02.ccbb.utexas.edu
- Dell PowerEdge R430
- dual 18-core/36-thread CPUs
- 384 GB RAM
wcarcomp03.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 26-core/52-thread CPUs
- 1 TB RAM
- 1.8 TB SATA SSD for ultra-high-speed local I/O, mounted as /ssd1 (not backed up)

georstor01.ccbb.utexas.edu

12 8-TB disks + 12 14

30 x 16-TB disks

264

480 TB raw,

158

285 TB usable

Georgiou, WCAAR, FRI-BigDataBio

GSAF pod

Anchor

	GSAF_POD
	GSAF_POD

Shared pod for use by GSAF customers. 2TB Work area allocation available for participating groups.

Contact Anna Battenhouse for more information.

Anna Battenhouse
Dhivya Arasappan

gsafcomp01.ccbb.utexas.edu

gsafcomp02.ccbb.utexas.edu
- Dell PowerEdge R410
- dual 4-core/8-thread CPUs
- 64 GB RAM
gsafcbig01.ccbb.utexas.edu
- Dell PowerEdge R720
- dual 6-core/12-thread CPUs
- 192 GB RAM

gsafstor01.ccbb.utexas.edu

24 6

18 8-TB disks
144 TB raw,

90

95 TB usable

GSAF customer groups:
Alper, Atkinson, Baker, Barrick, Bolnick, Bray, Browning, Cannatella, Contrearas, Crews, Drew, Dudley, Eberhart, Ellington, GSAFGuest, Hawkes, HoWinson, HyunJunKim, Kirisits, Leahy, Leibold, LiuHw, Lloyd, Manning, Matz, Mueller, Paull, Press, SSung, ZhangYJ

GSAF internal & instructional groups:
GSAF

, BioComputing2017

,

CCBB_Workshops_1, FRI-BigDataBio

Hopefog (Ellington) pod

Shared pod for Ellington & Marcotte lab special projects

Anna Battenhouse

hfogcomp01.ccbb.utexas.edu
- Dell PowerEdge R730xd
- dual 10-core/20-thread CPUs
- 250 GB RAM
- 37 TB local RAID storage, mounted as /raid (not backed up)
hfogcomp02.ccbb.utexas.edu,
hfogcomp03.ccbb.utexas.edu
- AMD GPU servers
- 48-core/96-hyperthread EPYC CPU
- 512 GB RAM
- 8 AMD Radeon Instinct MI-50 GPUs w/32GB onboard RAM each
- use /tmp (512 GB on NVMe) for fast local I/O
hfogcomp04.ccbb.utexas.edu
- Dell PowerEdge R750XA
- dual 24-core/48-thread CPUs
- 512 GB RAM
- 2 NVIDIA Ampere A100 GPUs w/80GB onboard RAM each
- 1.8 TB NVMe mounted as /NVMe1 for fast local I/O (not backed up)
hfogcomp05.ccbb.utexas.edu
- GIGABYTE MC62-G40-00
- 32-core/64-thread AMD Ryzen CPU
- 512 GB RAM
- 4 NVIDIA RTX 6000 Ada GPUs, 48G RAM each
- 1.8 TB NVMe mounted as /NVMe1 for fast local I/O (not backed up)

hfogstor01.ccbb.utexas.edu

24

6

8-TB disks

144

194 TB raw,

90

110 TB usable	Ellington, Marcotte, Wilke
Iyer/Kim/Young pod	Shared pod for members of the Vishy Iyer and Jonghwan Kim labs	Anna Battenhouse Rebecca Young	iyercomp02.ccbb.utexas.edu (aka dragonfly.icmb.utexas.edu) Dell PowerEdge R410 dual 4-core/8-thread CPUs 64GB RAM iyercomp03.ccbb.utexas.edu (aka adler3.icmb.utexas.edu) Dell PowerEdge R720 dual 6-core/12-thread CPUs 192 GB RAM iyercomp04.ccbb.utexas.edu Dell PowerEdge R660xs dual 28-core/56-thread CPUs 756 GB RAM 1.9 TB SSD for high-speec local I/O, mounted as /ssd1 (not backed up)	iyerstor01.ccbb.utexas.edu

24 6

18 18-TB disks

144

324 TB raw,

90

190 TB usable

Iyer, JKim, Young

Kirkpatrick pod

Shared pod for members of Kirkpatrick and Harpak labs

TBD

kirkcomp01.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 26-core/52-thread CPUs
- 768 GB RAM
- 1.9 TB SSD for high-speec local I/O, mounted as /ssd1 (not backed up)

kirkstor01.ccbb.utexas.edu

12 18-TB disks
216 TB raw, 124 TB usable

Kirkpatrick, Harpak

Lambowitz /CCBB pod

Shared pod for use by CCBB affiliates and the Alan Lambowitz lab.

Hans, Hofmann, (Hofmann lab & CCBB affiliates)
Jun Yao (Lambowitz lab)

lambcomp02.ccbb.utexas.edu
- Dell PowerEdge R660xs
- dual 28-core/56-thread CPUs
- 512 GB RAM
ccbbcomp02.ccbb.utexas.edu
- Dell PowerEdge R720
- dual 6-core/12-thread CPUs
- 192 GB RAM

lambstor01.ccbb.utexas.edu

18

24 16-TB disks

288

384 TB raw,

170

225 TB usable

Lambowitz groups:
Lambowitz, LambGuest

CCBB groups:
Cannatella, Hawkes, Hillis, Hofmann, Jansen

Instructional groups:
FRI-BigDataBio

LiveStrong DT pod

Pod for members of Dell Medical School's LiveStrong Diagnostic Theraputics group.

Note: This POD uses UT EID authentication

Jeanne Kowalski

livecomp01.ccbb.utexas.edu
- Dell PowerEdge R440
- dual 14-core/28-thread CPUs
- 192 GB RAM
- 480 GB SATA SSD for ultra-high-speed local I/O, mounted as /ssd1 (not backed up)
livecomp02.ccbb.utexas.edu, livecomp03.ccbb.utexas.edu
- AMD GPU server
- 48-core/96-hyperthread EPYC CPU
- 512 GB RAM
- 8 AMD Radeon Instinct MI-50 GPUs with 32GB onboard RAM each
livecomp04.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 26-core/52-hyperthread CPUs
- 768 GB RAM
- 1.9 TB SSD for high-speec local I/O, mounted as /ssd1 (not backed up)

livestor01.ccbb.utexas.edu

24

10

20-TB disks

240

480 TB raw,

132

280 TB usable

Jeanne Kowalski groups:
CancerClinicalGenomics, ColoradoData, MultipleMyeloma

Lauren Ehrlich groups:
Ehrlich_COVID19, Ehrlich,

Other groups:
Kim, Matsui, Melamed_COVID

Instructional groups:
FRI-BigDataBio

Marcotte/Gilpin pod

Single-lab pod for members of the Edward Marcotte

lab

and William Gilpin labs

Anna Battenhouse

marccomp01.ccbb.utexas.edu (aka hopper.icmb.utexas.edu)
- Dell PowerEdge R730
- dual 18-core/36-thread CPUs
- 768 GB RAM
marccomp02.ccbb.utexas.edu (aka ada.icmb.utexas.edu)
- Dell PowerEdge R610
- dual 4-core/8-thread CPUs
- 96 GB RAM

marccomp03

gilpcomp01.ccbb.utexas.edu

(aka perutz.ccbb.utexas.edu)

Dell PowerEdge R610
dual 4-core/8-thread CPUs
96 GB RAM

marcstor02.ccbb.utexas.edu

24 12-TB disks
288 TB raw, 160 TB usable

Marcotte

- ThinkMate GPU server
- dual 96-core CPUs
- 768 GB RAM
- 4 GH100 NVIDIA GPUs with 96G onboard RAM each
- 4 13TB NVMe drives for high-speed local I/O (not backed up). Mounted as:
  /ssd1, /ssd2, /ssd3, /ssd4

marcstor02.ccbb.utexas.edu (a.k.a. marcstor01.ccbb.utexas.edu)

30 16-TB disks
480 TB raw, 285 TB usable

Marcotte, Gilpin

Ochman/Moran pod

Shared pod for members of the Howard Ochman and Nancy Moran labs

Howard Ochman

ochmcomp01.ccbb.utexas.edu
- Dell PowerEdge R430
- dual 18-core/36-thread CPUs
- 384 GB RAM
ochmcomp02.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 26-core/52-hyperthread CPUs
- 1024 GB RAM
- 1.9 TB SSD for high-

speec

- speed local I/O, mounted as /ssd1, (not backed up)

ochmstor01.ccbb.utexas.edu

24 8

18 18-TB disks

192

324 TB raw,

106

190 TB usable

Ochman, Moran

Rental pod

Shared pod for pod rental customers

Anna Battenhouse (overall)
Daylin Morgan (Brock)

rentcomp01.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 18-core/36-thread CPUs
- 768 GB RAM
- 900 GB

SATA

- SSD for

ultra-high-speed

- fast local I/O, mounted as /ssd1 (not backed up)
rentcomp02.ccbb.utexas.edu
- Dell PowerEdge R640
- dual 18-core/36-thread CPUs
- 256 GB RAM
- 450 GB

SATA

- SSD for

ultra-high-speed

- fast local I/O, mounted as /ssd1 (not backed up)

rentstor01144 TB raw, 90

rentcomp03.ccbb.utexas.edu

12 12-TB disks

- Dell PowerEdge R660xs
- dual 28-core/56-thread CPUs
- 756 GB RAM
- 1.8 TB SSD for fast local I/O mounted as /ssd1 (not backed up)

rentstor01.ccbb.utexas.edu

24 16-TB disks
288 TB raw, 170 TB usable

Brock, Calder, Champagne, Curley, Fleet, Fonken, Gore, Gray, Gross, Hillis,

Nguyen

Jara, Matouschek, Raccah, Sedio, Seidlits, Sullivan, YiLu, Zamudio, ZhangYJ

Wilke pod

For use by members of the Claus Wilke lab

Aaron Feller
Alexis Hill

wilkcomp01.ccbb.utexas.edu
wilkcomp02.ccbb.utexas.edu
- Dell PowerEdge R930
- quad 14-core/28-thread CPUs
- 1 TB RAM
wilkcomp03.ccbb.utexas.edu
- GIGABYTE MC62-G40
- 48-core AMD Ryzen 5975 CPU
- 500 G system RAM
- 4 NVIDIA RTX 6000 Ada GPUs, 48G RAM each
- 2 TB SSD for fast local I/O, mounted as /ssd1 (not backed up)

wilkstor01.ccbb.utexas.edu

18 16-TB disks
288 TB raw, 170 TB usable

Wilke

Multiple POD group membership

...

Resource

Description

Network availability

For details

SSHSSL

Remote

sshprovides remote access to the bash shell's command line

, and

on compute servers
remote file transfer commands such as scp and rsync

.

to/from compute servers or the shared storage server

Standard ssh command unrestricted from the UT campus network (excluding Dell Medical School)
Off-campus ssh access:
- UT VPN service active, or
- Public key installed in ~/.ssh/authorized_keys
Notes:
- Direct storage server access for file transfers are is only accessible from the UT campus network or with the UT VPN service active.

Samba

Allows mounting of Work and Home areas on the shared POD storage server as a remote file system that can be browsed from your Windows or Mac desktop/laptop computer

Unrestricted from the UT campus network (excluding Dell Medical School)
Off-campus access requires the UT VPN service to be active

Samba remote file system access

HTTPS

Access to web-based R Studio server and JupyterHub server applications from any compupte server

Unrestricted for BRCF-managed accounts
- For PODs using EID authentication (e.g. Livestrong), an active UT EID is required

...

Compute servers can be accessed via ssh using either their formal BRCF name or their alias. For Mac and Linux users, ssh is available from any Terminal window. For Windows users, any , and from the Windows Command Prompt or PowerShell. There are also other SSH client program can be used, available such as PuTTY (http://www.putty.org/).

...

Shared Work areas are backed up weekly. Scratch areas are not backed up. Both Work and Scratch areas may have quotas, depending on the POD (e.g. on the Rental or GSAF pod); such quotas are generally in the multi-terabyte range.

Because it has a large quota and is regularly backed up and archived, your group's Work area is where large research artifacts that need to be preserved should be located.

Scratch, on the other hand, can be used for artifacts that are transient or can easily be re-created (such as downloads from public databases).

See Manage storage areas by project activity for important guidelines for Work and Scratch area contents.

...

Note that any directory in any file system tree named tmp, temp, or backups is not backed up. Directories with these names are intended for temporary files, especially large numbers of small temporary files. See "Cannot create tempfile" error and Avoid having too many small files.

Periodic and long-term archiving

Data on the backup server are periodically archived to TACC's Ranch tape archive roughly once a year. Current archives are as of:

20242025-01 (many but in progress)
2024-01 (many but not all PODs)
2022-01
2020-07

...

Remember that PODs are shared resources, and it is important to be aware of how your work can affect others trying to use POD resources. Here are some tips for using POD resources wisely.

Storage management considerations

Manage storage areas by project activity

Shared POD storage servers are high capacity (~50 to ~250 TB), but space is not infinite! The same goes for backup storage, since the BRCF must have capacity to back up all POD Home and Work areas. The following guidelines will help you and your colleagues stay within storage limits.

There are several types of data activity that determine where the data should reside:

Data that is active, such as project directories where new files are added and ongoing analysis is taking place.
This data belongs in your Work area where it is regularly backed up.
Data that is no longer active, or is active but read-only) but needs to be accessible for reference, and needs to be preserved.
E.g. projects that are complete but that you refer to from time to time.
This data belongs in your Scratch area so that it does not consume backup space.
Please contact us at rctf-support@utexas.edu to request that a long-term archive of the data be made to tape.
We can also efficiently move the data from Work to Scratch for you since we can access the storage server directly.
Data that is no longer active and does not need to be referenced, but needs to be preserved.
This data can be removed entirely so that it does not consume either storage server or backup server space.
External/public data or downloaded software that needs to be accessible but does not need to be backed up or preserved.
This data always belongs in Scratch since it can be re-downloaded if necessary.
Data that is no longer active, does not need to be referenced, and does not need to be preserved.
You can delete this data yourself, or contact us to remove the data for you (we can do this efficiently since we can access the storage server directly).

This table summarizes these guidelines.

...

current project & analysis directories that are read and wrtten

...

store in regularly backed-up Work area

...

no-longer-active projects that still need to be referenced
read-only data such as FASTQ or other instrument-generated files

...

store in Scratch area
contact rctf-support@utexas.edu to create a tape archive copy and to move the directories from Work to Scratch for you

...

no-onger-active projects that do not need to be readily accessible

...

contact rctf-support@utexas.edu to create a tape archive copy for you, then remove the data (either from Work or Scratch)

...

data and annotations from public databases
downloaded software

...

always store in Scratch area, since this is external data that can be re-downloaded if necessary

...

abandoned projects
external data or software that is no longer deleted

...

delete the directories/files yourself, or contact rctf-support@utexas.edu to delete it for you

Avoid having too many small files

While the ZFS file system we use is quite robust, we can experience issues in the weekly backup and periodic archiving process when there are too many small files in a directory tree.

What is too many? Ten million or more.

If the files are small, they don't take up much storage space. But the fact that there are so many causes the backup or archiving to run for a really long time. For weekly backups, this can mean that the previous week's backup is not done by the time the next one starts. For archiving, it means it can take weeks on end to archive a single directory that has many millions of small files.

...

Code Block

language	bash

df -i /stor/work/MyGroup/my_dir

The results might look something like this:

Code Block

language	bash

Filesystem               Inodes     IUsed        IFree IUse% Mounted on
stor/work/MyGroup  103335902213  28864562 103307037651    1% /stor/work/MyGroup

The IUsed column (here 28864562) is the number of inodes (files plus directories) in the directory tree listed under Filesystem (here /stor/work/MyGroup). Note that the reported Filesystem may be different from the one you queried, depending on the structure of the ZFS file systems.

There are a several work-arounds for this issue.

1) Move the files to a temporary directory.
The backup process excludes any sub-directory anywhere in the file system directory tree named tmp, temp, or backups. So if there are files you don't care about, just rename the directory to, for example, tmp. There will be a one-time deletion of the directory under its previous name, but that would be it.

2) Move the directories to a Scratch area.
Scratch areas are not backed up, so will not cause an issue. The directory can be accessed from your Work area via a symbolic link. Please Contact Us if you would like us to help move large directories of yours to Scratch (we can do it more efficiently with our direct access to the storage server).

3) Zip or Tar the directory
If these are important files you need to have backed up, ziping or taring the directory is the way to go. This converts a directory and all its contents into a single, larger file that can be backed up or archived efficiently. Please Contact Us if you would like us to help with this, since with our direct access to the storage server we can perform zip and tar operations much more efficiently than you can from a compute server.

If your analysis pipeline creates many small files as a matter of course, you should consider modifying the processing to create small files in a tmp directory then ziping or taring the as a final step.

Memory usage considerations

Using too much RAM can quickly make a compute server unusable. When a system's main random access memory (RAM) is filled and additional memory requests are made, "pages" of main memory will be written out to "swap" space on disk, then read back in when again needed. Since disk I/O is on the order of 1,000 times slower than RAM access, swapping can slow a system down considerably.

And in a pathological (but unfortunately not uncommon) pattern, a program (or programs) that need more memory than available can cause "thrashing" where swapping in and out of RAM is happening continuously. This will bring a computer to its knees, making it virtually impossible to do anything on it (slow logins, or logins timing out; any simple command just "hanging" for a long time or never returning). Note that when each processes a user starts itself spawns multiple threads, as described at Do not run too many processes, this situation can happen. We monitor system usage, and will intervene when we see this happen, by termininating the offending process(es) if possible, or by rebooting the compute server if not.

You can avoid causing a problem like this by following this advice:

Tips:

Know the memory configuration of the compute server you're using
free -g will show you total RAM and swap in Gigabytes
Before starting a memory intensive job, check the system's current memory status
free -g also shows used and available for both main memory and swap
Know the memory requirements of your program.
Monitor the memory usage of one typical process while it is running using top (see https://www.booleanworld.com/guide-linux-top-command/) or htop
This is particularly important if you plan to run multiple instances of a program, since it will guide you in knowing how many such instances you should run.
Use ulimit -H -m <max_ram> to limit the memory a given process can use.
e.g. to ensure a 25G memory limit is enforced and inherited by child processes:
( ulimit -H -m 25000000000 && exec <program or script> )
see https://ss64.com/bash/ulimit.html
or this, which uses cgroups to manage memory for the program and all child processes/threads:
systemd-run --user --scope -p MemoryMax=300G <program or script>
see https://manpages.debian.org/stretch/systemd/systemd-run.1.en.html
Run memory intensive processes when system load is otherwise light (e.g. overnight)
No single user should run programs that use excessive RAM
Less than 75% of total RAM if running when system load is otherwise light (e.g. overnight), and the programs are not expected to run for more than a few hours
Less than 25% of total RAM otherwise

Computational considerations

This section describes a number of computation-related considerations.

Multi-processing: cores vs hyperthreads

Many programs offer an option to divide their work among multiple processes, which can reduce the total clock time the program will run. The option may refer to "processes", "cores" or "threads", but actually target the available computing units on a server. Examples include: samtools sort --threads option; bowtie2 -p/--threads option; in R, library(doParallel); registerDoParallel(cores = NN) and the OMP_NUM_THREADS environment variable for OpenMP programs.

A "computing unit" is a server's cores and hyperthreads, and it is important to keep in mind the difference between the two. Cores are physical computing units, while hyperthreads are virtual computing units – kernel objects that "split" each core into two hyperthreads so that the single compute unit can be used by two processes.

The POD Resources and Access: AvailablePODs table describes the compute servers that are associated with each BRCF pod, along with their available cores and (hyper)threads. (Note that most servers are dual-CPU, meaning that total core count is double the per-CPU core count, so a dual 4-core CPU machine would have 8 cores.) You can also see the hyperthread and core counts on any server via:

...

language	bash

...

Memory usage considerations

Using too much RAM can quickly make a compute server unusable. When a system's main random access memory (RAM) is filled and additional memory requests are made, "pages" of main memory will be written out to "swap" space on disk, then read back in when again needed. Since disk I/O is on the order of 1,000 times slower than RAM access, swapping can slow a system down considerably.

And in a pathological (but unfortunately not uncommon) pattern, a program (or programs) that need more memory than available can cause "thrashing" where swapping in and out of RAM is happening continuously. This will bring a computer to its knees, making it virtually impossible to do anything on it (slow logins, or logins timing out; any simple command just "hanging" for a long time or never returning). Note that when each processes a user starts itself spawns multiple threads, as described at Do not run too many processes, this situation can happen. We monitor system usage, and will intervene when we see this happen, by termininating the offending process(es) if possible, or by rebooting the compute server if not.

You can avoid causing a problem like this by following this advice:

Tips:

Know the memory configuration of the compute server you're using
free -g will show you total RAM and swap in Gigabytes
Before starting a memory intensive job, check the system's current memory status
free -g also shows used and available for both main memory and swap
Know the memory requirements of your program.
Monitor the memory usage of one typical process while it is running using top (see https://www.booleanworld.com/guide-linux-top-command/) or htop
This is particularly important if you plan to run multiple instances of a program, since it will guide you in knowing how many such instances you should run.
Use ulimit -H -m <max_ram> to limit the memory a given process can use.
e.g. to ensure a 25G memory limit is enforced and inherited by child processes:
( ulimit -H -m 25000000000 && exec <program or script> )
see https://ss64.com/bash/ulimit.html
or this, which uses cgroups to manage memory for the program and all child processes/threads:
systemd-run --user --scope -p MemoryMax=300G <program or script>
see https://manpages.debian.org/stretch/systemd/systemd-run.1.en.html
Run memory intensive processes when system load is otherwise light (e.g. overnight)
No single user should run programs that use excessive RAM
Less than 75% of total RAM if running when system load is otherwise light (e.g. overnight), and the programs are not expected to run for more than a few hours
Less than 25% of total RAM otherwise

Computational considerations

This section describes a number of computation-related considerations.

Multi-processing: cores vs hyperthreads

Many programs offer an option to divide their work among multiple processes, which can reduce the total clock time the program will run. The option may refer to "processes", "cores" or "threads", but actually target the available computing units on a server. Examples include: samtools sort --threads option; bowtie2 -p/--threads option; in R, library(doParallel); registerDoParallel(cores = NN) and the OMP_NUM_THREADS environment variable for OpenMP programs.

A "computing unit" is a server's cores and hyperthreads, and it is important to keep in mind the difference between the two. Cores are physical computing units, while hyperthreads are virtual computing units – kernel objects that "split" each core into two hyperthreads so that the single compute unit can be used by two processes.

The POD Resources and Access: AvailablePODs table describes the compute servers that are associated with each BRCF pod, along with their available cores and (hyper)threads. (Note that most servers are dual-CPU, meaning that total core count is double the per-CPU core count, so a dual 4-core CPU machine would have 8 cores.) You can also see the hyperthread and core counts on any server via:

Code Block

language	bash

cat /proc/cpuinfo | grep -c 'core id'           # actually the number of hyperthreads!
cat /proc/cpuinfo | grep 'siblings' | head -1   # the real number of physical cores

(Yes, the fact that 'core id' gives hyperthreads and 'siblings' the number of cores is confusing. But what do you expect -- this is Unix )

Since hyperthreads look like available computing units ("CPUs in OS displays), parallel processing options that detect "cores" usually really detect hyperthreads. Why does this matter?

The bottom line:

virtual Hyperthreads are useful if the work a process is doing periodically "yields", typically to perform input/output operations, since waiting for I/O allows the core to be used for other work. The majority of NGS tools fall into this category since they read/write sequencing and other data files.
phycical Cores are best used when a program's work is compute-bound. When processing is compute bound -- as is typical of matrix-intensive machine learning algorithms -- hyperthreads actually degrade performance, because two compute-bound hyperthreads are competing for the same physical core, and there is OS-level overhead involved in process switching between the two.

So before you select a process/core/thread count for your program, consider whether it will perform significant I/O. If so, you can specify a higher count. If it is compute bound (e.g. machine learning), be sure to specify a count low enough to leave free hyperthreads for others to use.

Note that while you can't specify cores versus hyperthreads specifically, when you request some number of processes/corees/threads, the OS will allocate cores first, then hyperthreads.

Note also that the issue with machine learning (ML) workflows being incredibly compute bound is the main reason ML processing is best run on GPU-enabled servers, either at TACC or on one of the BRCF pods with GPUs (see BRCF GPU servers).

Do not run too many processes

Having described how to run multiple processes, it is important that you do not run too many processes at a time, because you are just using one compute server, and you're not the only one using the machine!

Note that starting a single instance of a program can sometimes spawn many threads. For example, Each instance of an OpenMP program by default will use all available threads. To avoid this with OpenMP, set the OMP_NUM_THREAD environment variable (e.g. export OMP_NUM_THREADS=1). However it is important to check the documentation for the particular program being used, and also use top (press the 1 key to see per-hyperthread load) or htop to see how many threads a single instance of a program uses.

How many is "too many"? That really depends on what kind of job it is, what compute/input-output mix it has, and how much RAM it needs. As a general rule, don't run more simultaneous jobs on a POD compute server than you would run on a single TACC compute node.

Before running mutiple jobs, you should check RAM usage (free -g will show usage in GB) and see what is already running using the top program (press the 1 key to see per-hyperthread load), or using the who command, or with a command like this:

Code Block

language	bash

ps -ef | grep -v root | grep -v bash | grep -v sshd | grep -v screen | grep -v tmux | grep -v 'www-data'

Here is a good article on all the aspects of the top command: https://www.booleanworld.com/guide-linux-top-command/

Finally, be sure to lower the priority of your processes using renice as described below (e.g. renice -n 15 -u `whoami`).

Lower priority for large, long-running jobs

If you have one or more jobs that uses multiple threads, or does significant I/O, its execution can affect system responsiveness for other users.

To help avoid this, please use the renice tool to manipulate the priority of your tasks (a priority of 15 is a good choice). It's easy to do, and here's a quick tutorial: http://www.thegeekstuff.com/2013/08/nice-renice-command-examples/?utm_source=tuicool

For example, before you start any tasks, you can set the default priority to nice 15 as shown here. Anything you start from then on (from this shell) should inherit the nice 15 value.

Code Block

language	bash

renice +15 $$

Once you have tasks running, their priority can be changed for all of them by specifying your user name:

Code Block

language	bash

renice +15 -u `whoami`

or for a particular process id (PID):

Code Block

language	bash

renice +15 -p <some PID number>

Running processes unattended

While POD compute servers do not have a batch system, you can still run multiple tasks simultaneously in several different ways.

For example, you can use terminal multiplexer tools like screen or tmux to create virtual terminal sessions that won't go away when you log off. Then, inside a screen or tmux session you can create multiple sub-shells where you can run different commands.

You can also use the command line utility nohup to start processes in the background, again allowing you to log off and still have the process running.

Here are some links on how to use these tools:

nohup - http://linux.101hacks.com/unix/nohup-command/
screen - https://kb.iu.edu/d/acuy
tmux -

Input/Output considerations

Avoid heavy I/O load

Please be aware of the potential effects of the input/output (I/O) operations in your workflows.

Many common bioinformatics workflows are largely I/O bound; in other words, they do enough input/output that it is essentially the gating factor in execution time. This is in contrast to simulation or modeling type applications, which are essentially compute bound.

It is underappreciated that I/O is much more difficult to parallelize than compute. To add more compute power, one can generally just increase the number of processors, their speed, and optimize their CPU-to-memory architecture, which greatly affects compute-bound tasks.

I/O, on the other hand, is harder to parallelize. Large compute clusters such as TACC expose large single file system namespaces to users (e.g. Work, Scratch), but these are implemented using multiple redundant storage systems managed by a sophisticated parallel file system (Lustre, at TACC) to appear as one. Even so, file system outages at TACC caused by heavy I/O are not uncommon.

In the POD architecture, all compute servers share a common storage server, whose file system is accessed over a high-bandwidth local network (NFS over 10 Gbit ethernet). This means that heavy I/O to shared storage initiated from any compute server can negatively affect users on all compute servers.

For example, as few as three simultaneous invocations of gzip or samtools sort on large files can degrade system responsiveness for other users. If you notice that doing an ls or command completion on the command line seems to be taking forever, this can be a sign of an excessive I/O load (although very high compute loads can occasionally cause similar issues).

To gauge your program's I/O usage:

Run it on smaller datasets first
Check I/O effects by:
do ls /stor/work
If the listing doesn't appear, or appears only after a significant delay, there is too much I/O going on
exercising tab-completion from the command line (see below)
tab completion is directly impacted by I/O load, so if it slow there's too much I/O going on

Code Block
ls /st # actually the number of hyperthreads! cat /proc/cpuinfo \| grep 'siblings' \| head -1 # the real number of physical cores

(Yes, the fact that 'core id' gives hyperthreads and 'siblings' the number of cores is confusing. But what do you expect -- this is Unix )

Since hyperthreads look like available computing units ("CPUs in OS displays), parallel processing options that detect "cores" usually really detect hyperthreads. Why does this matter?

The bottom line:

virtual Hyperthreads are useful if the work a process is doing periodically "yields", typically to perform input/output operations, since waiting for I/O allows the core to be used for other work. The majority of NGS tools fall into this category since they read/write sequencing and other data files.
phycical Cores are best used when a program's work is compute-bound. When processing is compute bound -- as is typical of matrix-intensive machine learning algorithms -- hyperthreads actually degrade performance, because two compute-bound hyperthreads are competing for the same physical core, and there is OS-level overhead involved in process switching between the two.

So before you select a process/core/thread count for your program, consider whether it will perform significant I/O. If so, you can specify a higher count. If it is compute bound (e.g. machine learning), be sure to specify a count low enough to leave free hyperthreads for others to use.

Note that while you can't specify cores versus hyperthreads specifically, when you request some number of processes/corees/threads, the OS will allocate cores first, then hyperthreads.

Note also that the issue with machine learning (ML) workflows being incredibly compute bound is the main reason ML processing is best run on GPU-enabled servers, either at TACC or on one of the BRCF pods with GPUs (see BRCF GPU servers).

Do not run too many processes

Having described how to run multiple processes, it is important that you do not run too many processes at a time, because you are just using one compute server, and you're not the only one using the machine!

Note that starting a single instance of a program can sometimes spawn many threads. For example, Each instance of an OpenMP program by default will use all available threads. To avoid this with OpenMP, set the OMP_NUM_THREAD environment variable (e.g. export OMP_NUM_THREADS=1). However it is important to check the documentation for the particular program being used, and also use top (press the 1 key to see per-hyperthread load) or htop to see how many threads a single instance of a program uses.

How many is "too many"? That really depends on what kind of job it is, what compute/input-output mix it has, and how much RAM it needs. As a general rule, don't run more simultaneous jobs on a POD compute server than you would run on a single TACC compute node.

Before running mutiple jobs, you should check RAM usage (free -g will show usage in GB) and see what is already running using the top program (press the 1 key to see per-hyperthread load), or using the who command, or with a command like this:

Code Block

language	bash

ps -ef | grep -v root | grep -v bash | grep -v sshd | grep -v screen | grep -v tmux | grep -v 'www-data'

Here is a good article on all the aspects of the top command: https://www.booleanworld.com/guide-linux-top-command/

Finally, be sure to lower the priority of your processes using renice as described below (e.g. renice -n 15 -u `whoami`).

Lower priority for large, long-running jobs

If you have one or more jobs that uses multiple threads, or does significant I/O, its execution can affect system responsiveness for other users.

To help avoid this, please use the renice tool to manipulate the priority of your tasks (a priority of 15 is a good choice). It's easy to do, and here's a quick tutorial: http://www.thegeekstuff.com/2013/08/nice-renice-command-examples/?utm_source=tuicool

For example, before you start any tasks, you can set the default priority to nice 15 as shown here. Anything you start from then on (from this shell) should inherit the nice 15 value.

Code Block

language	bash

renice +15 $$

Once you have tasks running, their priority can be changed for all of them by specifying your user name:

Code Block

language	bash

renice +15 -u `whoami`

or for a particular process id (PID):

Code Block

language	bash

renice +15 -p <some PID number>

Running processes unattended

While POD compute servers do not have a batch system, you can still run multiple tasks simultaneously in several different ways.

For example, you can use terminal multiplexer tools like screen or tmux to create virtual terminal sessions that won't go away when you log off. Then, inside a screen or tmux session you can create multiple sub-shells where you can run different commands.

You can also use the command line utility nohup to start processes in the background, again allowing you to log off and still have the process running.

Here are some links on how to use these tools:

nohup - http://linux.101hacks.com/unix/nohup-command/
screen - https://kb.iu.edu/d/acuy
tmux -

Input/Output considerations

Avoid heavy I/O load

Please be aware of the potential effects of the input/output (I/O) operations in your workflows.

Many common bioinformatics workflows are largely I/O bound; in other words, they do enough input/output that it is essentially the gating factor in execution time. This is in contrast to simulation or modeling type applications, which are essentially compute bound.

It is underappreciated that I/O is much more difficult to parallelize than compute. To add more compute power, one can generally just increase the number of processors, their speed, and optimize their CPU-to-memory architecture, which greatly affects compute-bound tasks.

I/O, on the other hand, is harder to parallelize. Large compute clusters such as TACC expose large single file system namespaces to users (e.g. Work, Scratch), but these are implemented using multiple redundant storage systems managed by a sophisticated parallel file system (Lustre, at TACC) to appear as one. Even so, file system outages at TACC caused by heavy I/O are not uncommon.

In the POD architecture, all compute servers share a common storage server, whose file system is accessed over a high-bandwidth local network (NFS over 10 Gbit ethernet). This means that heavy I/O to shared storage initiated from any compute server can negatively affect users on all compute servers.

For example, as few as three simultaneous invocations of gzip or samtools sort on large files can degrade system responsiveness for other users. If you notice that doing an ls or command completion on the command line seems to be taking forever, this can be a sign of an excessive I/O load (although very high compute loads can occasionally cause similar issues).

To gauge your program's I/O usage:

Run it on smaller datasets first
Check I/O effects by exercising tab-completion from the command line (see below)
tab completion is directly impacted by I/O load, so if it slow there's too much I/O going on

Code Block

ls /st                   # Typing this + Tab expands to /stor
ls /stor/sy              # Typing this + Tab expands to /stor/system
ls /stor/system/o        # Typing this + Tab expands to /stor/system/opt
ls /stor/system/opt/sam  # Typing this + Tab expands to /stor/system/opt/samtools (not uniquely)

# Typing this + Tab twice will list many possible completions:
ls /stor/system/opt/samtools/bam

Reduce the I/O priority of your processes

Similar to the way renice reduces the CPU priority of your processes (see above), ionice can reduce the I/O priority. This can be done for all your processes or for specific ones:

Code Block

# lower I/O priority for process number <pid>
ionice -c 2 -n 7 -p <pid>

# lower I/O priority for all your processes
ionice -c 2 -n 7 -u <uid>

# and here's how to find your <uid> (user ID)
grep $USER /etc/passwd | awk -F ':' '{print $3}'

Transfer large files directly to the storage server

BRCF storage servers are just Linux servers, but ones you access from compute servers over a high-speed internal network. While they are not available for interactive shell (ssh) access; they provide direct file transfer capability via scp or rsync.

Using the storage server as a file transfer target is useful when you have many files and/or large files, as it provides direct access to the shared storage. Going through a compute server is also possible, but involves an extra step in the path – from the compute-server to its network-attached storage-server.

The solution is to target your POD's storage server directly using scp or rsync. When you do this, you are going directly to where the data is physically located, so you avoid extra network hops and do not burden heavily-used compute servers.

Tip
Note that direct storage server file transfer access is only available from UT network addresses, from TACC, or using the UT VPN service.

...

Typing this + Tab expands to /stor
ls /stor/sy              # Typing this + Tab expands to /stor/system
ls /stor/system/o        # Typing this + Tab expands to /stor/system/opt
ls /stor/system/opt/sam  # Typing this + Tab expands to /stor/system/opt/samtools (not uniquely)

# Typing this + Tab twice will list many possible completions:
ls /stor/system/opt/samtools/bam

Reduce the I/O priority of your processes

Similar to the way renice reduces the CPU priority of your processes (see above), ionice can reduce the I/O priority. This can be done for all your processes or for specific ones:

Code Block

# lower I/O priority for process number <pid>
ionice -c 2 -n 7 -p <pid>

# lower I/O priority for all your processes
ionice -c 2 -n 7 -u <uid>

# and here's how to find your <uid> (user ID)
grep $USER /etc/passwd | awk -F ':' '{print $3}'

Transfer large files directly to the storage server

BRCF storage servers are just Linux servers, but ones you access from compute servers over a high-speed internal network. While they are not available for interactive shell (ssh) access; they provide direct file transfer capability via scp or rsync.

Using the storage server as a file transfer target is useful when you have many files and/or large files, as it provides direct access to the shared storage. Going through a compute server is also possible, but involves an extra step in the path – from the compute-server to its network-attached storage-server.

The solution is to target your POD's storage server directly using scp or rsync. When you do this, you are going directly to where the data is physically located, so you avoid extra network hops and do not burden heavily-used compute servers.

Tip
Note that direct storage server file transfer access is only available from UT network addresses, from TACC, or using the UT VPN service.

Please see this FAQ for more information: I'm having trouble transferring files to/from TACC.

Storage management considerations

Manage storage areas by project activity

Shared POD storage servers are high capacity (~50 to ~250 TB), but space is not infinite! The same goes for backup storage, since the BRCF must have capacity to back up all POD Home and Work areas. The following guidelines will help you and your colleagues stay within storage limits.

There are several types of data activity that determine where the data should reside:

Data that is active, such as project directories where new files are added and ongoing analysis is taking place.
This data belongs in your Work area where it is regularly backed up.
Data that is no longer active, or is active but read-only) but needs to be accessible for reference, and needs to be preserved.
E.g. projects that are complete but that you refer to from time to time.
This data belongs in your Scratch area so that it does not consume backup space.
Please contact us at rctf-support@utexas.edu to request that a long-term archive of the data be made to tape.
We can also efficiently move the data from Work to Scratch for you since we can access the storage server directly.
Data that is no longer active and does not need to be referenced, but needs to be preserved.
This data can be removed entirely so that it does not consume either storage server or backup server space.
External/public data or downloaded software that needs to be accessible but does not need to be backed up or preserved.
This data always belongs in Scratch since it can be re-downloaded if necessary.
Data that is no longer active, does not need to be referenced, and does not need to be preserved.
You can delete this data yourself, or contact us to remove the data for you (we can do this efficiently since we can access the storage server directly).

This table summarizes these guidelines.

#	active?	external?	needs to be accessible?	needs to be preserved?	examples	process/actions
1	yes	no	yes	TBD	current project & analysis directories that are read and wrtten	store in regularly backed-up Work area
2	no	no	yes	yes	no-longer-active projects that still need to be referenced read-only data such as FASTQ or other instrument-generated files	store in Scratch area contact rctf-support@utexas.edu to create a tape archive copy and to move the directories from Work to Scratch for you
3	no	no	no	yes	no-onger-active projects that do not need to be readily accessible	contact rctf-support@utexas.edu to create a tape archive copy for you, then remove the data (either from Work or Scratch)
4	yes	yes	yes	no	data and annotations from public databases downloaded software	always store in Scratch area, since this is external data that can be re-downloaded if necessary
5	no	yes or no	no	no	abandoned projects external data or software that is no longer deleted	delete the directories/files yourself, or contact rctf-support@utexas.edu to delete it for you

Periodic Work area storage management

Note that in order to manage our backup server storage efficiently, we periodically examine storage server Work area directories and move older data to Scratch as follows:

The Work area directory is transferred to a corresponding area under /stor/scratch/<group_name>/archive.
And the Work area directory is replace with a symbolic link to the Scratch directory
The directory contents are archived to TACC's ranch tape archive system, so there is a "Long Term Archive" (LTA) copy.
All data under /stor/scratch/<group_name>/archive has been archived to ranch.

Avoid having too many small files

While the ZFS file system we use is quite robust, we can experience issues in the weekly backup and periodic archiving process when there are too many small files in a directory tree.

What is too many? Ten million or more.

If the files are small, they don't take up much storage space. But the fact that there are so many causes the backup or archiving to run for a really long time. For weekly backups, this can mean that the previous week's backup is not done by the time the next one starts. For archiving, it means it can take weeks on end to archive a single directory that has many millions of small files.

Backing up gets even worse when a directory with many files is just moved or renamed. In this case the files need to be deleted from the old location and added to the new one – and both of these operations can be extremely long-running.

To see how many files (termed "inodes" in Unix) there are under a directory tree, use the df -i command. For example:

Code Block

language	bash

df -i /stor/work/MyGroup/my_dir

The results might look something like this:

Code Block

language	bash

Filesystem               Inodes     IUsed        IFree IUse% Mounted on
stor/work/MyGroup  103335902213  28864562 103307037651    1% /stor/work/MyGroup

The IUsed column (here 28864562) is the number of inodes (files plus directories) in the directory tree listed under Filesystem (here /stor/work/MyGroup). Note that the reported Filesystem may be different from the one you queried, depending on the structure of the ZFS file systems.

There are a several work-arounds for this issue.

1) Move the files to a temporary directory.
The backup process excludes any sub-directory anywhere in the file system directory tree named tmp, temp, or backups. So if there are files you don't care about, just rename the directory to, for example, tmp. There will be a one-time deletion of the directory under its previous name, but that would be it.

2) Move the directories to a Scratch area.
Scratch areas are not backed up, so will not cause an issue. The directory can be accessed from your Work area via a symbolic link. Please Contact Us if you would like us to help move large directories of yours to Scratch (we can do it more efficiently with our direct access to the storage server).

3) Zip or Tar the directory
If these are important files you need to have backed up, ziping or taring the directory is the way to go. This converts a directory and all its contents into a single, larger file that can be backed up or archived efficiently. Please Contact Us if you would like us to help with this, since with our direct access to the storage server we can perform zip and tar operations much more efficiently than you can from a compute server.

If your analysis pipeline creates many small files as a matter of course, you should consider modifying the processing to create small files in a tmp directory then ziping or taring the as a final step.

Other available POD services

...

Version	Old Version 175	New Version Current
Changes made by	Anna Battenhouse	Anna Battenhouse
Saved on	Jan 27, 2025	Nov 21, 2025

Versions Compared

Key

Multiple POD group membership

Periodic and long-term archiving

Storage management considerations

Manage storage areas by project activity

Avoid having too many small files

Memory usage considerations

Computational considerations

Multi-processing: cores vs hyperthreads

Memory usage considerations

Computational considerations

Multi-processing: cores vs hyperthreads

Do not run too many processes

Lower priority for large, long-running jobs

Running processes unattended

Input/Output considerations

Avoid heavy I/O load

Do not run too many processes

Lower priority for large, long-running jobs

Running processes unattended

Input/Output considerations

Avoid heavy I/O load

Reduce the I/O priority of your processes

Transfer large files directly to the storage server

Reduce the I/O priority of your processes

Transfer large files directly to the storage server

Storage management considerations

Manage storage areas by project activity

Periodic Work area storage management

Avoid having too many small files

Other available POD services