NVSwitch: Leveraging NVLink to Maximum Effect

GPUs have been PCIe devices for many generations in client systems, and more recently in servers. The rapid growth in deep learning workloads has driven the need for a faster and more scalable interconnect, as PCIe bandwidth increasingly becomes the bottleneck at the multi-GPU system level. As deep learning neural networks become more sophisticated, their size and complexity will continue to expand, as will the data sets they ingest to deliver meaningful insights. This expansion will in turn drive demand for compute, as well as memory capacity to further accelerate deep learning innovations.

In addition to general growth of workloads, there are a number of emerging techniques that will also require more GPUs that can quickly communicate with one another. Advanced approaches such as model parallel training, which involves training different parts of the AI model in parallel, have been developed but require large memory and extremely fast connections between GPUs to work. A critical operation of deep learning training is for GPUs to exchange updated model weights, biases and gradients so that all GPUs working on a network remain synchronized. This operation necessitates significant memory traffic to complete these exchanges. In HPC, large FFTs are a key workload for seismic imaging, defense or life sciences, a single NVIDIA DGX-2 system can be up to 2.4 faster than two DGX-1 system. NVIDIA’s new DGX-2 brings a 16-GPU system to market, and NVIDIA’s new NVSwitch technology is the backbone of this system.

NVSwitch Die Shot

NVSwitch is implemented on a baseboard as six chips, each of which is an 18-port, NVLink switch with an 18×18-port fully-connected crossbar. Each baseboard has six NVSwitch chips on it, and can communicate with another baseboard to enable 16-GPUs in a single server node.

Read the NVSwitch technical overview for more details. 

Each NVSwitch chip has the following:

Vital Statistics:

Port Configuration 18 NVLINK ports
Speed per Port 50 GByte/s per NVLINK port
(total for both directions)
Connectivity Fully-connected crossbar internally
Transistor Count 2 billion

Each of the eight GPUs on one baseboard are connected using a single NVLink to each of six NVSwitch chips on the baseboard. Eight ports on each of the NVSwitch chips are used to communicate with the other baseboard. Each of eight GPUs on a baseboard can communicate with any of the other GPUs on the baseboard at full bandwidth of 300 GB/sec with a single NVSwitch traversal.

NVSwitch-enabled 16-GPU system is delivering 2.4x more performance on the IFS Weather Simulator mini-app, and 2.7x on deep learning training. (See notes for workload details and system configs).

With the continuing explosive growth of neural networks’ size, complexity and designs, it’s difficult to predict the exact form those networks will take, but one thing remains certain: the appetite for deep learning compute will continue to grow along with it. In the HPC domain, workloads like weather modeling using large-scale, FFT-based computations will also continue to drive demand for compute horsepower.  And with a 16-GPU configuration with 2 petaFLOPS of compute horsepower and a half-terabyte of GPU memory in a unified address space, applications can scale up without requiring knowledge of the underlying physical topology.

Learn more >

Workload Notes and System Configuration Information:
System Configs: Each of the two DGX-1 servers have dual-socket Xeon E5 2690v4 Processor, 8 x V100 GPUs; servers connected via a 4 EDR (100Gb) InfiniBand connections. DGX-2 server has dual-socket Xeon Scalable Processor Platinum 8168 Processors, 16 x Tesla V100 GPUs
ECWMF’s IFS: The Integrated Forecasting System (IFS) is a global numerical weather prediction model developed by the European Centre for Medium-Range Weather Forecasts (ECMWF) based in Reading, United Kingdom. ECMWF is an independent intergovernmental organization supported by most of the nations of Europe, and operates one of the largest supercomputer centers in Europe for frequent updates of global weather forecasts. The IFS mini-app benchmark focuses its work on a spherical harmonics transformation that represents a significant computational load of the full model. The benchmark speedups shown here are better than those for the full IFS model, since the benchmark amplifies the transform stages of the algorithm (by design). However, this benchmark demonstrates that ECMWF’s extremely effective and proven methods for providing world-leading predictions remain valid on NVSwitch-equipped servers such as NVIDIA’s DGX-2, since they are such a good match to the problem.
Mixture of Experts (MoE): Based on a network published by Google at the Tensor2Tensor GitHub, using the Transformer model with MoE layers. The MoE layers each consist of 128 experts, each of which is a smaller feed-forward deep neural network (DNN). Each expert specializes in a different domain of knowledge, and the experts are distributed to different GPUs, creating significant all-to-all traffic due to communications between the Transformer network layers and the MoE layers. The training dataset used is the “1 billion word benchmark for language modeling” according to Google. Training operations use Volta Tensor Core and runs for 45,000 steps to reach perplexity equal to 34. This workload uses a batch size of 8,192 per GPU.