Developer Blog: Accelerating WinML and NVIDIA Tensor Cores

Every year, clever researchers introduce ever more complex and interesting deep learning models to the world. There is of course a big difference between a model that works as a nice demo in isolation and a model that performs a function within a production pipeline.

This is particularly pertinent to creative apps where generative models must run with low latency to generate or enhance image– or video-based content.

In many situations, to reduce latency and provide the best interaction, you often want to perform inference on a local workstation GPU rather than the cloud.

There are several constraints to consider when deploying to the workstation:

  • Hardware
    • This is unknown when you build the model.
    • This may change after installation. A user may have a GTX1060 one day and an RTX6000 the next.
  • Resources
    • When they’re deployed in the cloud, resources are a lot more predictable than when they’re deployed on a workstation.

The overriding advantage of workstation execution is the removal of any extra latency going to and from a remote service that may not already be guaranteed.

NVIDIA Tensor Cores

On NVIDIA RTX hardware, from the Volta architecture forward, the GPU includes Tensor Cores to enable acceleration of some of the heavy lift operations involved with deep learning. Essentially, the Tensor Cores enable an operation called warp matrix multiply-accumulate (wmma), providing optimized paths for FP16-based (hmma) and integer-based (imma) matrix multiplication.

To take full advantage of the hardware acceleration, it’s important to understand the exact capabilities of the Tensor Cores.

Convolutional neural networks contain many convolution layers that, when you examine the core operation, come down to many dot products. These operations can be batched together to run as a single, large, matrix multiplication operation.

Read the full blog, Accelerating WinML and NVIDIA Tensor Cores, on the NVIDIA Developer Blog.