Deep learning is fueling breakthroughs in everything from consumer mobile apps to image recognition. Yet running deep learning-based AI models poses many challenges. One of the most difficult roadblocks is the time it takes to train the models.

The need to crunch lots of data and the computational complexity of building deep learning-based AI models also slows down the progress in accuracy and the practicality of deploying deep learning at scale. It’s the training times — often measured in days, sometimes weeks — that slow down implementation.

In order to cut the time it takes to create deep learning models with high precision, we need to reduce the time associated with deep learning training from days to hours to minutes or seconds.

GPUs too fast for their own good

In order to understand the problem deep learning researchers are trying to solve, consider the simple tale of the Blind Men and the Elephant. In the fable, each blind man feels a different part of the elephant — but only one part, such as the side or the tusk. Then they argue about what the entire elephant looks like based on their own limited experience.

If you gave the blind men some time, they could share enough information to piece together a pretty accurate picture of an elephant. It’s the same with graphics processing units (GPUs), which are used with CPUs to accelerate deep learning, analytics, and computing.

If you have slow compute chips in a system, you can keep them synced on their learning progress fairly easily.

But, as GPUs become smarter and faster, they crunch through their learning very quickly, and they need a better means of communicating or they get out of sync. Then they spend too much time waiting for each other’s results. So, you can get no speedup — and potentially even degraded performance — from using more, faster-learning GPUs.

The functional gap in deep learning systems

To achieve improved fast-model training, data scientists and researchers need to distribute deep learning across a large number of servers. However, most popular deep learning frameworks scale across GPUs or learners within a server, but not to many servers with GPUs.

The challenge is, it’s difficult to orchestrate and optimize a deep learning problem across many servers, because the faster GPUs run, the faster they learn. GPUs also need to share their learning with all of the other GPUs, but at a rate that isn’t possible with conventional software.

This functional gap in deep learning systems recently led an IBM Research team to develop distributed deep learning (DDL) software and algorithms that automate and optimize the parallelization of large and complex computing tasks across hundreds of GPU accelerators attached to dozens of servers.

For this software, the researchers developed a custom communication library that helps all the learners (GPUs) in the system communicate with each other at very close to optimal speeds and bandwidths. And the library isn’t hard-coded into just one deep learning software package, so it can be integrated with frameworks such as TensorFlow, Caffe, and Torch.

The communication between the GPUs used in this project was critical to breaking the training record for image recognition capabilities. Researchers were able to reduce training time for the neural network, called ResNet-50, to only 50 minutes. On another network, ResNet-101, they obtained a new accuracy record of 33.8 percent using 7.5 million training images. These images came from ImageNet, a large dataset that contains over 15 million labeled high-resolution images belonging to around 22,000 different categories.

Taking this approach allows data scientists and machine learning researchers to quickly improve accuracy and train neural network models, computer software modeled on the human brain and nervous system. Neural network models that are trained to high accuracy will be able to complete specific tasks like detect cancer cells in medical images. And their accuracy can be further improved by retraining, which takes seconds.

Moving deep learning out of the ivory tower

The goal, of course, is to make AI algorithms and software, as well as other machine learning technologies, operate as quickly as possible. Through systems design and system innovations, DDL software like this could solve the deep learning productivity problem. The faster you can initiate creating new AI capabilities, the faster consumers experience more accuracy in things like picture labeling or speech recognition.

AI is already becoming faster, more intelligent, and higher functioning. But we need to move deep learning out of the ivory tower, where training times and accuracies still need further improvement. To do so, we have to accelerate the time it takes to get innovations from researchers’ hands and into customers’ hands, who need business results in minutes or seconds. It’s up to researchers to find new ways to handle deep learning faster, with the right frameworks, to tackle persistent and challenging AI problems.

Hillery Hunter is an IBM fellow and director of the accelerated cognitive infrastructure group at IBM’s T.J. Watson Research Center in Yorktown Heights, New York.