In September 2018, NVIDIA introduced NVIDIA TensorRT Inference Server, a production-ready solution for data center inference deployments. TensorRT Inference Server maximizes GPU utilization, supports all popular AI frameworks, and eliminates writing inference stacks from scratch. You can learn more about TensorRT Inference Server in this NVIDIA Developer blog post. Today we are announcing that NVIDIA TensorRT Inference Server is now an open source project.
NVIDIA is a dedicated supporter of the open source community, with over 120 repositories available from our GitHub page, over 800 contributions to deep learning projects by our deep learning frameworks team in 2017, and contributions of many large-scale projects such as RAPIDS, NVIDIA DIGITS, NCCL, and now, TensorRT Inference Server.
Open sourcing TensorRT Inference Server will let developers customize and integrate it into their data center inference workflows. Examples of how developers can extend TensorRT Inference Server include:
- Custom pre- and post-processing
Developers now have much more flexibility to handle pre- and post-processing, letting them customize TensorRT Inference Server for capabilities such as image augmentation, feature expansion, or video decoding. Integrating the processing directly into the inference server improves performance over handling those tasks separately.
- Additional framework backends
TensorRT Inference Server supports all the top deep learning frameworks today through support for TensorFlow, TensorRT, Caffe2, and others via the ONNX path. Now developers have the freedom to integrate additional frameworks of their choice directly into the inference server to further simplify model deployment for their environments.
To help developers with their efforts, the TensorRT Inference Server documentation includes detailed build and test instructions in addition to API reference documentation.
Improve Utilization with Dynamic Batching
NVIDIA will continue to develop TensorRT Inference Server hand-in-hand with the community to add new features and functionality. For example, the latest release includes a widely requested feature, dynamic batching.
Batching requests before they are sent for processing reduces overhead significantly and improves performance, but logic needs to be written to handle the batching. With the new dynamic batching feature, separate requests are combined automatically by TensorRT Inference Server to create batches dynamically. The user has control over batch size and latency to tune performance for their specific needs. This eliminates the work required to write and deploy a batching algorithm in front of the of the inference server, which simplifies integration and deployment.
An open source TensorRT Inference Server allows the community to help shape the direction of the product and lets users build solutions specific to their use cases immediately, while helping others with similar needs. We invite you to contribute to this new project on GitHub and give us feedback.
To learn how to get started, read the new NVIDIA Developer Blog post, “How to Speed Up Deep Learning Inference Using TensorRT“.
Download TensorRT Inference Server source from GitHub, or get the compiled solution in a ready-to-deploy container from the NGC container registry with monthly updates.