NVIDIA GPU servers
Three BRCF research pods have NVIDIA GPU servers; however their use is restricted to the groups who own those pods.
- 1 Servers
- 1.1 Hopefog pod
- 1.2 Wilke pod
- 1.3 Marcotte/Gilpin pod
- 2 GPU-enabled software
- 2.1 AlphaFold2
- 2.2 TensorFlow and PyTorch
- 2.3 GROMACS
- 3 Resources
- 3.1 CUDA
- 3.2 Command-line diagnostics
- 3.3 Sharing resources
Servers
Hopefog pod
hfogcomp04.ccbb.utexas.edu compute server on the Hopefog pod (Ellington/Marcotte):
Dell PowerEdge R750XA
dual 24-core/48-thread CPUs (48 cores, 96 hyperthreads total)
512 GB system RAM
2 NVIDIA Ampere A100 GPUs w/80GB onboard RAM each
1.8 TB NVMe drive mounted as /NVMe1 for fast local I/O – not backed up
hfogcomp05.ccbb.utexas.edu
GIGABYTE MC62-G40-00
32-core/64-thread AMD Ryzen CPU
512 GB system RAM
4 NVIDIA RTX 6000 Ada GPUs, 48G RAM each
1 TB NVMe drive mounted as /NVMe1 for fast local I/O – not backed up
Wilke pod
wilkcomp03.ccbb.utexas.edu compute server
GIGABYTE MC62-G40-00 workstation
AMD Ryzen 5975WX CPU (32 cores, 64 hyperthreads total)
512 GB system RAM
4 NVIDIA RTX 6000 Ada GPUs
14 TB NVMe drive mounted as /ssd1 for fast local I/O – not backed up
Marcotte/Gilpin pod
gilpcomp01.ccbb.utexas.edu compute server
ThinkMate GPU server
dual AMD EPYC 9654 96-core CPUs
768 GB system RAM
4 NVIDIA GH100 GPUs, 96G RAM each
4 NVMe drives for fast local I/O – not backed up
/ssd1, /ssd2 – 13 TB, OS is on separate mirrored partitions
/ssd3, /ssd4 - 14 TB
GPU-enabled software
AlphaFold2
The AlphaFold2 protein structure solving software is available on all NVIDIA GPU servers. The /stor/scratch/AlphaFold directory has the large required database under the data.3 sub-directory. There is also an NVIDIA example script alphafold_example_nvidia.sh script.
TensorFlow and PyTorch
Two Python example scripts are located in /stor/scratch/GPU_info that can be used to ensure you have access to the server's GPUs from TensorFlow or PyTorch. Run them from the command line like this:
Tensor Flow
python3 /stor/scratch/GPU_info/tensorflow_example.py
PyTorch
python3 /stor/scratch/GPU_info/pytorch_example.py
If GPUs are available and accessible, the output generated will indicate they are being used.
Note that our system-wide CUDA-enabled TensorFlow and PyTorch versions are available in both the default Python 3 command-line environment (e.g. python3 or python3.12 on the command line) and also in the global JupyterHub environment that uses the Python 3.12 kernel. If you need a different combination of Python and TensorFlow/PyTorch versions, you'll need to construct an appropriate custom conda environment (e.g. miniconda3 or anaconda).
GROMACS
An NVIDIA GPU-enabled version of the Molecular Dynamics (MD) GROMACS program is available on all NVIDIA GPU servers, and a CPU-only version is installed also.
The /stor/scratch/GROMACS directory has several useful resources:
benchmarks/ - a set of MD benchmark files from https://www.mpinat.mpg.de/grubmueller/bench
gromacs_nvidia_example.sh - a simple GROMACS example script taking advantage of the GPU, running the benchMEM.tpr benchmark by default.
gromacs_cpu_example.sh - an GROMACS example script using the CPUs only.
Resources
CUDA
Both hfogcomp04 and wilkcomp03 have both CUDA 11.8 and CUDA 12.9 installed, under version-specific subdirectories of /usr/local.
To ensure CUDA 11 is made active:
export CUDA_HOME=/usr/local/cuda-11.8
export PATH=$CUDA_HOME/bin:$PATHTo ensure CUDA 12 is made active:
export CUDA_HOME=/usr/local/cuda-12
export PATH=$CUDA_HOME/bin:$PATHNeither version is specified by default, and some (but not all) programs rely on these environment variables. So you should activate one or the other before running software that uses GPUs.
After setting these environment variables, type nvcc --version to ensure you have access to the desired version.
CUDA drivers are installed under /usr/lib/x86_64-linux-gnu/. To see what version is currently installed:
ls /usr/lib/x86_64-linux-gnu/libnvidia-gl*. See https://saturncloud.io/blog/where-did-cuda-get-installed-in-my-computer/.
Command-line diagnostics
Use nvidia-smi to verify access to the server's GPUs and to monitor GPU usage.
Sharing resources
Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.
Use top to monitor running tasks (or top -i to exclude idle processes)
commands while top is running include:
M - sort task list by memory usage
P - sort task list by processor usage
N - sort task list by process ID (PID)
T - sort task list by run time
1 - show usage of each individual hyperthread
they're called "CPUs" but are really hyperthreads
this list can be long; non-interactive mpstat may be preferred
Use mpstat to monitor overall CPU usage
mpstat -P ALL to see usage for all hyperthreads
mpstat -P 0 to see specific hyperthread usage
Use free -g to monitor overall RAM memory and swap space usage (in GB)
Use nvidia-smi to monitorGPU usage