Developer Blog: Speeding Up Deep Learning Inference Using TensorRT

NVIDIA TensorRT is an SDK for deep learning inference. TensorRT provides APIs and parsers to import trained models from all major deep learning frameworks. It then generates optimized runtime engines deployable in the datacenter as well as in automotive and embedded environments.

This post provides a simple introduction to using TensorRT. You learn how to deploy a deep learning application onto a GPU, increasing throughput and reducing latency during inference. It uses a C++ example to walk you through converting a PyTorch model into an ONNX model and importing it into TensorRT, applying optimizations, and generating a high-performance runtime engine for the datacenter environment.

TensorRT supports both C++ and Python; if you use either, this workflow discussion could be useful. If you prefer to use Python, see Using the Python API in the TensorRT documentation.

Deep learning applies to a wide range of applications such as natural language processing, recommender systems, image, and video analysis. As more applications use deep learning in production, demands on accuracy and performance have led to strong growth in model complexity and size.

Safety-critical applications such as automotive place strict requirements on throughput and latency expected from deep learning models. The same holds true for some consumer applications, including recommendation systems.

TensorRT is designed to help deploy deep learning for these use cases. With support for every major framework, TensorRT helps process large amounts of data with low latency through powerful optimizations, use of reduced precision, and efficient memory use.

To follow along with this post, you need a computer with a CUDA-capable GPU or a cloud instance with GPUs and an installation of TensorRT. On Linux, the easiest place to get started is by downloading the GPU-accelerated PyTorch container with TensorRT integration from the NVIDIA Container Registry (on NGC).

Read the full post, Speeding Up Deep Learning Inference Using TensorRT, on the NVIDIA Developer Blog