Speeding Up Deep Learning Training with NVIDIA V100 Tensor Core GPUs in the AWS Cloud

Training deep learning models on NVIDIA GPUs is the gold standard in artificial intelligence, but the process can still take weeks to complete. To help advance the work, a team from the Amazon Web Services cloud announced today a new scalable way to optimize the AWS infrastructure to minimize deep learning training times from weeks to days with GPUs.

“We demonstrate how to optimize AWS infrastructure to minimize deep learning training times by using distributed/multi-node synchronous training,” the Amazon team wrote in a blog post. “We use ResNet-50 with the ImageNet dataset and AWS EC2 P3 instances with NVIDIA Tesla V100 Tensor Core GPUs to benchmark our training times.”

The team trained their neural network in about 50 minutes using eight P3.16xlarge instances (64 V100 GPUs), using both cuDNN-accelerated MXNet and TensorFlow deep learning frameworks.

Developers using convolutional neural networks, recurrent neural networks, and generative adversarial networks can all now use the improved performance on their neural networks.

Framework Time to train Training throughput Achieved Top-1 Validation Accuracy Scaling Efficiency
Apache MXNet 47min ~44,000 Images/Sec 75.75% 92%
TensorFlow + Horovod 50min ~41,000 Images/Sec 75.54% 90%

“The results of this work demonstrate that the AWS platform can be used for rapid training of deep learning networks, using a performant, flexible and scalable architecture,” the team stated.  

The results were published today on an AWS blog.

Read more >