Video Tutorial: Accelerating Inference Performance of Recommendation Systems with TensorRT

NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput for deep learning inference applications. You can import trained models from every deep learning framework into TensorRT, and easily create highly efficient inference engines that can be incorporated into larger applications and services.
This video demonstrates the steps for using NVIDIA TensorRT to optimize a Multilayer Perceptron based Recommender System that is trained on the MovieLens dataset.

Five key things from this video:

Importing a trained TensorFlow model into TensorRT is made super easy with the help of Universal Framework Format (UFF) toolkit, which is included in TensorRT.
You can add an extra layer to the trained model even after importing it into TensorRT.
You can serialize the engine to a memory block, which you could then serialize to a file or stream. This eliminates the need to perform the optimization step again.
Although the model is trained with higher precision (FP32), TensorRT provides flexibility to do inference with lower precision (FP16).
TensorRT 4 includes new operations such as Concat, Constant, and TopK, plus optimizations for Multilayer Perceptrons to speed-up inference performance of recommendation systems.

Below is more information related to the video tutorial:
Code used in the video: sampleMovieLens
Jupyter Notebook used in the video: sampleMLP-notebook
Learn more about TensorRT: developer.nvidia.com/tensorrt