Nvidia has set six new records for how fast an AI model can be trained using a predetermined group of datasets.
MLPerf is a benchmark suite of tests created by prominent companies in the space to standardize and provide guidelines for how to measure AI training and inference speed. MLPerf is often used to share the speed of commercially available cloud computing services, mobile devices, or hardware in server hardware stacks.
Companies who contributed to the creation of MLPerf include Google, Nvidia, Baidu, and supercomputer maker Cray.
Nvidia set records for image classification with ResNet-50 version 1.5 on the ImageNet dataset, object instance segmentation, object detection, non-recurrent translation, recurrent translation, and recommendation systems.
“For all of these benchmarks we outperformed the competition by up to 4.7x faster,” Nvidia VP and general manager of accelerated computing Ian Buck said in a conference call with reporters. “There are certainly faster DGX-2 ResNet-50 renditions out there, but none under MLPerf benchmark guidelines.”
The feat was achieved using Nvidia DGX systems, using NVSwitch interconnectors to work with up to 16 fully connected V100 Tensor Core GPUs, which was first unveiled in spring 2017. Nvidia submitted and was judged in the single node category with 16 GPUs, as well as distributed training with 16 GPUs to 80 nodes with 640 GPUs.
With a single node, Nvidia was able to train with ResNet-50 in 70 minutes. With distributed training, Nvidia was able to train with ResNet-50 in 6.3 minutes. By comparison, it would have taken 25 days for a single CUDA GPU to train with ResNet-50 in 2015.
The rapid rise in compute power has played a major role in the emergence of AI as an influential force in technology, business, and society in recent years.