New IBM technique cuts AI speech recognition training time from a week to 11 hours

Reliable, robust, and generalizable speech recognition is an ongoing challenge in machine learning. Traditionally, training natural language understanding models requires corpora containing thousands of hours of speech and millions (or even billions) of words of text, not to mention hardware powerful enough to process them within a reasonable timeframe.

To ease the computational burden, IBM in a newly published paper ("Distributed Deep Learning Strategies for Automatic Speech Recognition") proposes a distributed processing architecture that can achieve a 15-fold training speedup with no loss in accuracy on a popular open source benchmark (Switchboard). Deployed on a system containing multiple graphics cards, the paper's authors say, it can reduce the total amount of training time from weeks to days.

The work is scheduled to be presented at the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) conference next month.

As contributing researchers Wei Zhang, Xiaodong Cui, and Brian Kingsbury explain in a forthcoming blog post, training an automatic speech recognition (ASR) system like those in Apple's Siri, Google Assistant, and Amazon's Alexa requires sophisticated encoding systems to convert voices to features understood by deep learning systems and decoding systems that convert the output to human-readable text. The models tend to be on the larger side, too, which makes training at scale more difficult.

The team's parallelized solution entails boosting batch size, or the number of samples that can be processed at once, but not indiscriminately -- that would negatively affect accuracy. Instead, they use a "principled approach" to increase the batch size to 2,560 while applying a distributed deep learning technique called asynchronous decentralized parallel stochastic gradient descent (ADPSGD).

As the researchers explain, most deep learning models employ either synchronous approaches to optimization, which are disproportionately affected by slow systems, or parameters-server (PS)-based asynchronous approaches, which tend to result in less accurate models. By contrast, ADPSGD -- which IBM first detailed in a paper last year -- is asynchronous and decentralized, guaranteeing a baseline level of model accuracy and delivering a speedup for certain types of optimization problems.

In tests, the paper's authors say that ADPSGD shortened the ASR job running time from one week on a single V100 GPU to 11.5 hours on a 32-GPU system. They leave to future work algorithms that can handle larger batch sizes and systems optimized for more powerful hardware.

"Turning around a training job in half a day is desirable, as it enables researchers to rapidly iterate to develop new algorithms," Zhang, Cui, and Kingsbury wrote. "This also allows developers fast turnaround time to adapt existing models to their applications, especially for custom use cases when massive amounts of speech are needed to achieve the high levels of accuracy needed for robustness and usability."

More