The NVIDIA TensorRT inference server GA version is now available for download in a container from the NVIDIA GPU Cloud container registry.
Announced at GTC Japan and part of the NVIDIA TensorRT Hyperscale Inference Platform, the TensorRT inference server is a containerized microservice for data center production deployments.
As more and more applications leverage AI, it has become vital to provide inference capabilities in production environments. Just as an application might call to a web server to leverage HTML content, modern applications need to access inference in this same way, via a simple API call. But existing solutions are often custom developed for a specific application and not for general purpose production, and aren’t optimized to get the most out of GPUs, limiting their usefulness.
The TensorRT inference server provides production quality inference capabilities in a ready-to-run container. It maximizes utilization by supporting multiple models per GPU so every GPU can service any incoming request, eliminating bottlenecks with previous solutions that could only support a single model per GPU. It supports all popular AI frameworks, so data scientists can develop their models in the best frameworks for the job. And the TensorRT inference server seamlessly integrates into DevOps deployments leveraging Docker and Kubernetes.
With the NVIDIA TensorRT inference server, there’s now a common solution for AI inference, allowing researchers to focus on creating high-quality trained models, DevOps engineers to focus on deployment, and developers to focus on their applications without needing to reinvent the AI plumbing over and over again.
Download the TensorRT inference server from the NVIDIA GPU Cloud container registry now.
Learn how to use the TensorRT inference server in this NVIDIA Developer Blog post.