PyTorch 1.0 Accelerated On NVIDIA GPUs

Facebook announced availability of PyTorch 1.0 preview release today at the PyTorch Developer Conference, an event for PyTorch Developer Community.
PyTorch is one of the most widely used deep learning frameworks by researchers and developers. PyTorch 1.0, announced by Facebook earlier this year, is a deep learning framework that powers numerous products and services at scale by merging the best of both worlds – the distributed and native performance found in Caffe2 and the flexibility for rapid development found in the existing PyTorch framework. At a high level, PyTorch is a Python package that provides high level features such as tensor computation with strong GPU acceleration. The preview release of PyTorch 1.0 provides an initial set of tools enabling developers to migrate easily from research to production.

NVIDIA and Facebook both strive to bring innovation and flexibility to the deep learning developer community. In 2017, NVIDIA and Facebook jointly announced a collaboration that enabled developers and researchers to create large-scale distributed training scenarios to build machine learning based applications for edge devices. We are further investing in our engineering efforts and working together to empower and engage with the PyTorch developer community.
Here are some of the highlighted collaborations:

A PyTorch Extension (APEX) are tools for easy Mixed Precision and Distributed Training in PyTorch.
For Mixed Precision: there are tools for AMP (Automatic Mixed Precision) and FP16_Optimizer. apex.amp is a tool designed for ease of use and maximum safety in FP16 training. AMP also automatically implements dynamic loss scaling. The intention of FP16_Optimizer is to achieve most of the numerically stability of full FP32 training, and almost all the performance benefits of full FP16 training. apex.FP16_Optimizer wraps an existing Python optimizer and automatically implements master parameters and static or dynamic loss scaling under the hood.
For Distributed Training: apex.parallel.DistributedDataParallel is a module wrapper that enables convenient multiprocess distributed training, optimized for NCCL
NVIDIA TensorRT platform offers support for PyTorch framework across the inference workflow. With the TensorRT optimizer and runtime engine, you can import PyTorch models through the ONNX format, apply INT8 and FP16 optimizations, calibrate for lower precision with high accuracy, and generate runtimes for production deployment. With TensorRT optimizations, applications perform up to 40x faster than CPU-only platforms.
NVIDIA TensorRT inference server is a containerized inference microservice that maximizes GPU utilization in data centers. PyTorch models can be used with the TensorRT inference server through the ONNX format, Caffe2’s NetDef format, or as TensorRT runtime engines. The TensorRT inference server seamlessly integrates into DevOps deployments with Docker and Kubernetes integration so that developers can focus on their applications, without needing to reinvent the plumbing for each AI-powered application.
PyTorch container available from the NVIDIA GPU Cloud container registry provides a simple way for users to get get started with PyTorch. Each month, NVIDIA takes the latest version of PyTorch and the latest NVIDIA drivers and runtimes and tunes and optimizes across the stack for maximum performance on NVIDIA GPUs. NVIDIA makes its contributions available to the upstream community and also packages them into a container for users to download at no charge.

Research Projects

Some of the notable research projects from the growing list of PyTorch community collaboration work using NVIDIA’s powerful GPUs and deep learning software stack for AI include:

Facebook Detectron is a Facebook AI Research software system that implements state-of-the-art object detection algorithms, including Mask R-CNN. It is written in Python and powered by the Caffe2 deep learning framework. FAIR’s research platform for object detection research, implementing popular algorithms like Mask R-CNN and RetinaNet. All baselines on Model Zoo were run on Big Basin servers with 8 NVIDIA Tesla P100 GPU accelerators.
A high resolution (2048 x 1024) photorealistic video-to-video translation based on conditional GANs by MIT using NVIDIA GPUs and software stack for deep learning.
Unsupervised language modeling at scale for robust sentiment classification using reduced precision FP16 arithmetic and tensor core architecture, the model trains in <1 day on 8 Volta-class GPUs down from the training time of 1 month* (Details on GitHub)
Image inpainting for irregular holes using partial convolution to produce semantically meaningful predictions smoothly with the rest of the image without additional blending or post processing operations.
Genesis of spatially displaced convolution (SDC) for video frame prediction that can handle large motion and allows the model to predict crisp future frames with motion closely matching that of ground-truth sequences using NVIDIA V100 Tensor Core GPUs for training.
Training deep AutoEncoders for Collaborative Filtering project using PyTorch features a new algorithm that can significantly speed up training and model performance.
Scaling Neural Machine Translation using Tensor Cores that enable efficient half precision floating point (FP) computations using NVIDIA Volta GPUs.

PyTorch is also used widely across HPC. Here are some notable projects in this space:

OpenChem is a project sponsored by the University of North Carolina at Chapel Hill and NVIDIA. OpenChem makes deep learning models an easy-to-use tool for computational chemistry and drug design researchers. It is a deep learning toolkit for computational Chemistry with PyTorch backend optimized for NVIDIA GPUs and allows faster training with multi-GPU support.
Deep learning technology has been successfully applied to different research areas in drug discovery. This study specifically implements PyTorch based variational generative autoencoders to map molecule structures.
The time-lagged autoencoders is a special type of deep neural networks implemented using PyTorch framework for deep learning of slow collective variable for molecular kinetics.
This study shows how you can learn fluid parameters from data, perform liquid control tasks and learn policies to manipulate liquids using SPNets (Smooth Particle Networks), a framework for integrating fluid dynamics with deep networks. The backward and forward functions for the DNN layers were implemented in PyTorch using NVIDIA TITAN Xp GPUs.

We are enabling more experiences that matter the most to our developer community. Get notified on NVIDIA, Facebook and the larger PyTorch ecosystem is enabling the next generation of powerful AI use-cases, and more news to your inbox. Subscribe to our developer newsletter to get notified on the latest NVIDIA product updates. If you are a developer or researcher looking for technical resources on deep learning, check out NVIDIA Developer Program to get started today!