NVIDIA TensorRT is a high-performance deep learning inference optimizer and runtime that delivers low latency and high-throughput. TensorRT can import trained models from every deep learning framework to easily create highly efficient inference engines that can be incorporated into larger applications and services.
Five Key Things from this video:
- TensorRT supports RNNv2, MatrixMultiply, ElementWise, TopK layers.
- Weights for each gate and layer need to be set separately for the RNNv2 layer and the input format for RNNv2 is BSE (Batch, Sequence, Embedding).
- Fully Connected layer can also be implemented with a MatrixMultiply layer and an Element Wise layer. Alternatively, you can directly use the Fully Connected layer of TensorRT, but it requires a reshape of the weights before they are fed to this layer.
- You can Serialize the Engine to a memory block, which you could then serialize to a file or stream. This eliminates the need to perform optimization step again.
- Although this sample is built using C++, you can implement the same with Python using TensorRT Python API.
To follow along with this video and get started: