Data Center / Cloud

NVIDIA and Red Hat: Simplifying NVIDIA GPU Driver Deployment on Red Hat Enterprise Linux

By Pramod Ramarao, Senior Product Manager, NVIDIA

NVIDIA GPUs are transforming enterprises by accelerating enterprise computing from inference, data science to large scale AI training, to VDI. Red Hat and NVIDIA have been working together for over 10 years to accelerate Red Hat Enterprise Linux (RHEL) workloads on NVIDIA GPU enabled servers – across the datacenter, virtualized environments and the cloud. To serve these diverse enterprise use-cases on RHEL, NVIDIA provides a software stack powered by the CUDA platform (drivers, CUDA-X acceleration libraries, CUDA optimized applications and frameworks). The NVIDIA / Red Hat partnership continues to grow and there are many integration efforts across Red Hat’s and NVIDIA’s product portfolios on projects as diverse as video drivers, heterogeneous memory management (HMM), KVM support for virtual GPUs, and Kubernetes.

Based on feedback from our users, NVIDIA and Red Hat have worked closely to improve the user experience when installing and updating NVIDIA software on RHEL, including GPU drivers and CUDA.

NVIDIA and Red Hat are announcing a technical preview of new packages for the GPU drivers on RHEL. The goal behind this is to improve the user experience of installing and upgrading these drivers on RHEL. By providing better integration of the drivers and RHEL on a technical level, the new packages remove the need to have compilers and a full software development toolchain installed on each system running NVIDIA GPUs, and simplify the management experience. In this blog post, we will provide an overview of the benefits of the new packages.

NVIDIA Driver Packages on RHEL

The new driver packaging yields three major benefits to users on RHEL.

1. It’s easy to stay on a tested RHEL / NVIDIA driver combination

A successful driver installation or upgrade (on non-DKMS driver branches, see below) always results in a combination of RHEL kernel / GPU driver versions that have been specifically tested. NVIDIA will provide driver packages for all enabled RHEL kernel /driver branch combinations for the respective lifetimes. The current DKMS-based packages still support fall-back custom configurations (i.e. users will be able to install or upgrade to an untested RHEL kernel/GPU driver combination).

In the technical preview, NVIDIA is making available Tesla drivers from the R418 and R430 driver branches for RHEL 7.6 on x86_64 (for a subset of all RHEL 7.6 kernels).

We will provide a support matrix of the driver branches enabled and tested on RHEL versions, and the lifetimes; expect more communication from us on this in the future.

2. User control on driver branch selection from a single RPM repository  

Users can now choose which of multiple branches of the NVIDIA GPU driver to follow from a single RPM repository. Some NVIDIA drivers are qualified for use on Tesla GPUs and may have extended lifetimes compared to other driver branches. Enterprise users may choose to stay on a specific driver branch for stability reasons, while other users may want to track other branches for access to new features.

Users can pick a specific driver branch (e.g., R430) that they want to track for updates, and will then only get updates from that branch.  We also provide a virtual branch called “latest” that tracks the most recent Tesla driver at each point in time. The branch “latest” is the default, other branches are opt-in, and branches can be switched without requiring reinstallation of the CUDA Toolkit.

3. New alternative to DKMS

The default driver packages use a new alternative to DKMS. DKMS is a powerful tool that allows you to build any kernel module for any kernel. However, this flexibility comes at the cost of requiring a full compiler and software development toolchain installed on each system with an NVIDIA GPU, which introduces additional risk factors during the compilation step of installation. The new approach strips this down to what is minimally required and allows package maintainers to have full control of the toolchain being used (e.g., so that the compiler that is used to build the kernel is also used for the driver’s kernel modules). In particular, GCC does not need to be installed on each system with an NVIDIA GPU, nor does the EPEL repository need to be enabled.

The source files of the open-source parts of the driver are compiled in advance at package build time, which is why we call these “pre-compiled” drivers. The source code for the open-source parts of the driver remains available. As with DKMS, any binaries containing proprietary parts of the driver are still shipped separately and are linked with other binaries only at installation time and on the user’s system.

To cover special use cases such as using the driver on a custom kernel, we also still offer one opt-in branch that is a DKMS-based variant of the “latest” branch (called “latest-dkms”).  Using the DKMS-based branch is of course not guaranteed to result in a kernel/driver version that has been tested.

Summary

The new NVIDIA driver packages on RHEL provide a better GPU driver installation and management experience to users on RHEL. To get started with the new packages, follow the instructions in the README. The packages currently support only RHEL 7.6 but we are working to quickly expand support on RHEL 8.

NVIDIA and Red Hat have increased the breadth of testing of the drivers on RHEL and are working on new features such as containerized drivers for use in Kubernetes environments such as Red Hat OpenShift and expanded support for platforms such as NVIDIA DGX and POWER based systems.

Please give us feedback on the technical preview of the new package improvements. Register for the NVIDIA developer program to report issues or provide feedback on the NVIDIA Developer Forums.

About the Author

Pramod Ramarao is a product manager for accelerated computing at NVIDIA. He leads product management for the CUDA platform and data center software, including container technologies.

Discuss (0)

Tags