Facebook published a paper today detailing how they are able to train nearly 1.3 million images in under an hour using 256 Tesla P100 GPUs that previously took days on a single system.
The team reduced the training time of a ResNet-50 deep learning model on ImageNet from 29 hours to one – which they did so by distributing training in larger minibatches across more GPUs. Previously, batches of 256 images were spread across eight Tesla P100 GPUs, but today’s work shows the same level of accuracy when training with large minibatch sizes up to 8,192 images distributed across 256 GPUs.
According to the paper, “to achieve this result, we adopt a linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training.” They were able to achieve near-linear SGD scaling by using an optimized allreduce implementation. For the local reduction, they used NVIDIA Collective Communications Library (NCCL) that implements multi-GPU collective communication primitives that are performance optimized for NVIDIA GPUs.
Facebook used the open source deep learning framework Caffe2 and their Big Basin GPU server that has eight NVIDIA Tesla P100 GPU accelerators that are interconnected using NVIDIA NVLink.