Austin's own Advanced Micro Devices (AMD) has most generously donated a number of GPU-enabled servers to UT.
While it is still true that AMD GPUs do not support as many 3rd party applications as NVIDIA, they do support many popular Machine Learning (ML) applications such as TensorFlow, PyTorch, and AlphaFold, and Molecular Dynamics (MD) applications such as GROMACS, all of which are installed and ready for use.
Our recently announced AMD GPU pod is available for instructional use and for research use for qualifying UT-Austin affiliated PIs. To request an allocation, contact us at rctf-support@utexas.edu, and provide the UT EIDs of those who should be granted access.
Two BRCF research pods also have AMD GPU servers available: the Hopefog and Livestong PODs. Their use is restricted to the groups who own those pods. See Livestrong and Hopefog pod AMD servers for specific information.
The AlphaFold protein structure solving software is available on all AMD GPU servers. The /stor/scratch/AlphaFold directory has the large required database, under the data.4 sub-directory. There is also an AMD example script /stor/scratch/AlphaFold/alphafold_example_amd.sh and an alphafold_example_nvidia.sh script if the POD also has NVIDIA GPUs, (e.g. the Hopefog pod). Interestingly, our timing tests indicate that AlphaFold performance is quite similar on all the AMD and NVIDIA GPU servers.
On AMD GPU servers, AlphaFold is implemented by a run_alphafold.py Python script inside a Docker image, See the run_alphafold_rocm.sh and run_multimer_rocm.sh scripts under /stor/scratch/AlphaFold for a complete list of options to that script.
Two Python scripts are located in /stor/scratch/GPU_info that can be used to ensure you have access to the server's GPUs from TensorFlow or PyTorch. Run them from the command line using time to compare the run times.
MIOpen(HIP): Warning [SQLiteBase] Missing system database file: gfx90878.kdb Performance may degrade. Please follow instructions to install:
https://github.com/ROCmSoftwarePlatform/MIOpen#installing-miopen-kernels-packageIf GPUs are available and accessible, the output generated will indicate they are being used.
All pod compute servers have 3 main Python environments, which are all managed separately (see About Python and JupyterHub server for more information about these environments):
We are working hard to get AMD-GPU-enabled versions of TensorFlow and PyTorch working in all three environments. Current status is as follows:
| POD | AMD-GPU-enabled PyTorch | AMD-GPU-enabled TensorFlow |
|---|---|---|
| AMD GPU |
|
|
| Hopefog |
|
|
| Livestrong |
|
|
If you need a different combination of Python and TensorFlow/PyTorch versions, you'll need to construct an appropriate custom Conda environment (e.g. miniconda3 or anaconda) as well as your own Jupyter Notebook environment if needed.
The AMD-GPU-specific version of TensorFlow, Tensorflow-rocm 2.9.1 is installed on most AMD GPU servers. This version works with ROCm 5.1.3+. If you need to install your own version with pip, specify the version explicitly, e.g.:
pip install tensorflow-rocm==2.9.1 |
You may also need to adjust your LD_LIBRARY_PATH as follows:
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH" |
AMD GPU-enabled version of the Molecular Dynamics (MD) GROMACS program is available on all AMD GPU servers, and a CPU-only version is installed also.
The /stor/scratch/GROMACS directory has several useful resources:
You'll see warnings like these when you run the GPU-enabled examples script; they can be ignored:
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware (If you have multiple ICDs installed and OpenCL works, you can ignore this message) ... libibverbs: Warning: couldn't load driver 'libefa-rdmav34.so': libefa-rdmav34.so: cannot open shared object file: No such file or directory libibverbs: Warning: couldn't load driver 'libbnxt_re-rdmav34.so': libbnxt_re-rdmav34.so: cannot open shared object file: No such file or directory ... |
ROCm is AMD's equivalent to the CUDA framework. ROCm is open source, while CUDA is proprietary.
We have multiple versions of the ROCm framework installed in the /opt directory, designated by a version number extension (e.g. /opt/rocm-5.1.3, /opt/rocm-5.2.3). The default version is the one pointed to by the /opt/rocm symbolic link, which is generally the latest version.
As of May 2024, the highest ROCm version installed (and the default) is rocm-5.7.2. This is the last minor version in the ROCm 5.x series. ROCm series 6.x versions have now been published, but we do not yet have them installed on the AMD compute servers.
To specify a particular ROCm version other than the default, set the ROCM_HOME environment variable; for example:
export ROCM_HOME=/opt/rocm-5.1.3 |
You may also need to adjust your LD_LIBRARY_PATH as follows:
export LD_LIBRARY_PATH="/opt/rocm-5.1.3/hip/lib:$LD_LIBRARY_PATH" |
Since there's no batch system on BRCF POD compute servers, it is important for users to monitor their resource usage and that of other users in order to share resources appropriately.
Best starting places: