Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

Austin's own Advanced Micro Devices (AMD) has most generously donated a number of GPU-enabled servers to UT.

While it is still true that AMD GPUs do not support as many 3rd party applications as NVIDIA, they do support many popular Machine Learning (ML) applications such as TensorFlow, PyTorch, and AlphaFold, and Molecular Dynamics (MD) applications such as GROMACS, all of which are installed and ready for use.

Two BRCF research pods have AMD GPU servers available: the Hopefog and Livestong PODs. Their use is restricted to the groups who own those pods. See Livestrong and Hopefog pod AMD servers for specific information.

The BRCF's AMD GPU pod is available for instructional use and for research use for qualifying UT-Austin affiliated PIs. Allocations are granted to groups who will only perform certain GPU-enabled workflows, not for general computation. To request an allocation, contact us at rctf-support@utexas.edu, and provide the UT EIDs of those who should be granted access.

The ROCm framework

ROCm is AMD's equivalent to the CUDA framework. ROCm is open source, while CUDA is proprietary.

ROCm versions and GPU type

We have multiple versions of the ROCm framework installed in the /opt directory, designated by a version number extension (e.g. /opt/rocm-5.7.2, /opt/rocm-5.2.3). The default version is the one pointed to by the /opt/rocm symbolic link, which is generally the latest version supported on the specific server.

Livestrong and Hopefog pod AMD GPU servers have MI-50 GPUs (livecomp02/03, hfogcomp02/03), which are now End-Of-Life (no longer supported) according to AMD. As of May 2024, the highest ROCm version supported for the MI-50 GPUs is rocm-5.7.2, which is the last minor version in the ROCm 5.x series. ROCm 5.7.2 is the default for these MI-50 servers, but a lower ROCm version may be selected.

AMD GPU pod servers have MI-100 GPUs (amdgcomp01/02/03), which support the newer ROCm 6.x series. As of July 2025, the ROCm default for these MI-100 servers is currently rocm-6.3.1.

Changing ROCm versions

To specify a particular ROCm version other than the default, set the ROCM_PATH environment variable; for example:

Code Block
export ROCM_PATH=/opt/rocm-5.1.3

You may also need to adjust your LD_LIBRARY_PATH as follows:

Code Block
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH"

GPU-enabled software

AlphaFold

The AlphaFold AlphaFold2 protein structure solving software is available on all AMD GPU servers.

The /stor/scratch/AlphaFold directory has the large required database, under the data.4 sub-directory. There is also an AMD example script /stor/scratch/AlphaFold/alphafold_example_amd.shand an alphafold_example_nvidia.sh script if the POD also has NVIDIA GPUs, (e.g. the Hopefog pod). 

On AMD GPU servers, AlphaFold is implemented by a run_alphafold.py Python script inside a Docker image, See the run_alphafold_rocm.sh and run_multimer_rocm.sh scripts under /stor/scratch/AlphaFold for a complete list of options to that script.

AlphaFold requires a number of databases in order to run and several versions of these databases can be found under /stor/scratch/AlphaFold/:

  • data.1, data.2, data.3, data.4  – the default, but can be changed in the run_*rocm.sh scripts

GROMACS

AMD GPU-enabled version of the Molecular Dynamics (MD) GROMACS program is available on all AMD GPU servers, and a CPU-only version is installed also.

The /stor/scratch/GROMACS directory has several useful resources:

  • benchmarks/ - a set of MD benchmark files from https://www.mpinat.mpg.de/grubmueller/bench
  • gromacs_amd_example.sh - a GROMACS example script taking advantage of the GPU, running the benchMEM.tpr benchmark by default.
  • gromacs_cpu_example.sh - a GROMACS example script using the CPUs only.

Pytorch and TensorFlow

All pod compute servers have 2 3 main Python environments, which are all managed separately (see About Python and JupyterHub server for more information about these environments):

  • command-line Python 2.7 (python2.7, pip2.7)
  • command-line Python 3.12 (python, python3, python3.12, pip, pip3, pip3.12)
  • web-based JupyterHub which uses the Python 3.12 kernel

We are working hard to get The status of AMD-GPU-enabled versions of TensorFlow and PyTorch working in all three environments. Current status each environment is as follows:

PODGPU-enabled PyTorch on all AMD-GPU serversGPU-enabled TensorFlow
AMD GPU
HopefogLivestrong
  • command-line python3, python3.8 (upgrade coming soon)
Livestrong
  • 12
    • must do this first:
      source /stor/scratch/GPU_info/activate_tensorflow_amd_conda
  • AMD-GPU-enabled TensorFlow in JupyterHub is not supported
    • JupyterHub runs the standard CPU-only TensorFlow
Hopefog
  • command-line python3, python3.12

Pytorch/TensorFlow example scripts

Two Python scripts are located in /stor/scratch/GPU_info that can be used to ensure you have access to the server's GPUs from TensorFlow or PyTorch. You can run them from the command line using time to see the run times.

  • Tensor Flow – AMD GPU pod servers (amdgcomp01/02/03)
    • time (python3 /stor/scratch/GPU_info/tensorflow_example.py )
  • Tensor Flow – Livestrong and Hopefog pod servers
    • time (python3 /stor/scratch/GPU_info/tensorflow_example.py )
  • PyTorch
    • time (python3 /stor/scratch/GPU_info/pytorch_example.py )
    • Model time should be ~30-45s with GPU on an unloaded system
    • You'll see this warning, which can be ignored:
      MIOpen(HIP): Warning [SQLiteBase] Missing system database file: gfx90878.kdb Performance
      may degrade. Please follow instructions to install:
      https://github.com/ROCmSoftwa
      rePlatform/MIOpen#installing-miopen-kernels-package

If GPUs are available and accessible, the output generated will indicate they are being used.

All pod compute servers have 2 main Python environments, which are all managed separately (see About Python and JupyterHub server for more information about these environments):

  • command-line Python 2.7 (python2.7, pip2.7)
  • command-line Python 3.12 (python, python3, python3.12, pip, pip3, pip3.12)
  • web-based JupyterHub which uses the Python 3.12 kernel

We are working hard to get AMD-GPU-enabled versions of TensorFlow and PyTorch working in all three environments. Current status is as follows:

About TensorFlow versions

The AMD-GPU-specific version of TensorFlow, tensorflow-rocm, is installed on all AMD GPU servers. The TensorFlow 

...

  • 8
    • upgrade coming soon; will be the same as Livestrong pod

About TensorFlow versions

The AMD-GPU-specific version of TensorFlow, tensorflow-rocm, is installed on all AMD GPU servers. However, while the MI-100 GPU servers on the AMD GPU pod support the latest tensorflow-rocm version (which requires ROCm 6.3), the MI-50 GPU servers on Livestrong and Hopefog pods requires an older version, tensorflow-rocm 2.13.0.570. Unfortunately the older version does not support Python versions higher than Python 3.10, and both our command-line and JupyterHub use Python 3.12. This is why AMD-GPU-enabled TensorFlow is not supported in JupyterHub on the Livestrong and Hopefog pod MI-50 AMD GPU servers.

On Livestrong and Hopefog pod MI-50 AMD GPU servers, the older version of TensorFlow can still be run from the command line, but is implemented in a globally-accessible miniconda3 environment. So on these MI-50 servers, you must activate the TensorFlow AMD conda first before running any TensorFlow code:

Code Block
languagebash
source /stor/scratch/GPU_info/activate_tensorflow_amd_conda

Once you're done running TensorFlow code, use conda deactivate to exit the conda environment.

On the AMD GPU pod, you can install your own local version of tensorflow-rocm with pip3, e.g.:

Code Block
pip3 install tensorflow-rocm==2.915.1

You may also See https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/tensorflow-install.html for a table showing TensorFlow/ROCm version compatibilities.

If your custom TensorFlow version requires a different ROCm version, you will need to adjust your ROCM_PATH and LD_LIBRARY_PATHas follows, e.g.:

Code Block
export ROCM_PATH=/opt/rocm-6.2.1
export LD_LIBRARY_PATH="/opt/rocm-56.2.1.3/hip/lib:$LD_LIBRARY_PATH"

GROMACS

AMD GPU-enabled version of the Molecular Dynamics (MD) GROMACS program is available on all AMD GPU servers, and a CPU-only version is installed also.

The /stor/scratch/GROMACS directory has several useful resources:

  • benchmarks/ - a set of MD benchmark files from https://www.mpinat.mpg.de/grubmueller/bench
  • gromacs_amd_example.sh - a simple GROMACS example script taking advantage of the GPU, running the benchMEM.tpr benchmark by default.
  • gromacs_cpu_example.sh - a GROMACS example script using the CPUs only.

Resources

ROCm environment

ROCm is AMD's equivalent to the CUDA framework. ROCm is open source, while CUDA is proprietary.

We have multiple versions of the ROCm framework installed in the /opt directory, designated by a version number extension (e.g. /opt/rocm-5.1.3, /opt/rocm-5.2.3). The default version is the one pointed to by the /opt/rocm symbolic link, which is generally the latest version.

As of May 2024, the highest ROCm version installed (and the default) is rocm-5.7.2. This is the last minor version in the ROCm 5.x series. ROCm series 6.x versions have now been published, but we do not yet have them installed on the AMD compute servers.

To specify a particular ROCm version other than the default, set the ROCM_HOME environment variable; for example:

Code Block
export ROCM_HOME=/opt/rocm-5.1.3

You may also need to adjust your LD_LIBRARY_PATH as follows:

Code Block
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH"

Pytorch/TensorFlow example scripts

Two Python scripts are located in /stor/scratch/GPU_info that can be used to ensure you have access to the server's GPUs from TensorFlow or PyTorch. You can run them from the command as shown below:

  • Tensor Flow – AMD GPU pod servers (amdgcomp01/02/03)
    • python3 /stor/scratch/GPU_info/tensorflow_example.py
  • Tensor Flow – Livestrong and Hopefog pod servers (livecomp02/03, hfogcomp02/03)
    • bash /stor/scratch/GPU_info/tensorflow_example.amd-mi50.sh 
  • PyTorch – all compute servers
    • python3 /stor/scratch/GPU_info/pytorch_example.py

If GPUs are available and accessible, the output generated should indicate they are being used.

Resources

Command-line diagnostics

  • GPU usage: rocm-smi
  • CPU and GPU details: rocminfo
  • What ROCm modules are installed: dpkg -l | grep rocm
  • GPU ↔ GPU/CPU communication bandwidth test
    • between GPU2 and CPU: rocm-bandwidth-test -b2,0
    • between GPU3 and GPU4: rocm-bandwidth-test -b3,4

...

Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.

  • Use rocm-smi to see GPU usage
  • Use top to monitor running tasks (or top -i to exclude idle processes)
    • commands while top is running include:
    • M - sort task list by memory usage
    • P - sort task list by processor usage
    • N - sort task list by process ID (PID)
    • T - sort task list by run time
    • 1 - show usage of each individual hyperthread
      • they're called "CPUs" but are really hyperthreads
      • this list can be long; non-interactive mpstat may be preferred
  • htop is another popular program for monitoring running processes
  • Use mpstat to monitor overall CPU usage
    • mpstat -P ALL to see usage for all hyperthreads
    • mpstat -P 0 to see specific hyperthread usage
  • Use free -g to monitor overall RAM memory and swap space usage (in GB)Use rocm-smi to see GPU usage

AMD GPU and ROCm resources

ROCm GPU-enabling framework

Best starting places:

Training Guides

  1. Introduction_to_AMD_7002_processor.pdf
  2. Radeon_Instinct_HPC_Training_2020.pdf
  3. Radeon_Instinct_ML_Training_2020.pdf

...