Google researchers boost speech recognition accuracy with more datasets

What if the key to improving speech recognition accuracy is simply mixing all available speech datasets together to train one large AI model? That's the hypothesis behind a recent study published by a team of researchers affiliated with Google Research and Google Brain. They claim an AI model named SpeechStew that was trained on a range of speech corpora achieves state-of-the-art or near-state-of-the-art results on a variety of speech recognition benchmarks.

Training models on more data tends to be difficult, as collecting and annotating new data is expensive -- particularly in the speech domain. Moreover, training large models is expensive and impractical for many members of the AI community.

Dataset solution

In pursuit of a solution, the Google researchers combined all available labeled and unlabelled speech recognition data curated by the community over the years. They drew on AMI, a dataset containing about 100 hours of meeting recordings, as well as corpora that include Switchboard (approximately 2,000 hours of telephone calls), Broadcast News (50 hours of television news), Librispeech (960 hours of audiobooks), and Mozilla's crowdsourced Common Voice. Their combined dataset had over 5,000 hours of speech -- none of which was adjusted from its original form.

With the assembled dataset, the researchers used Google Cloud TPUs to train SpeechStew, yielding a model with more than 100 million parameters. In machine learning, parameters are the properties of the data that the model learned during the training process. The researchers also trained a 1-billion-parameter model, but it suffered from degraded performance.

Once the team had a general-purpose SpeechStew model, they tested it on a number of benchmarks and found that it not only outperformed previously developed models but demonstrated an ability to adapt to challenging new tasks. Leveraging Chime-6, a 40-hour dataset of distant conversations in homes recorded by microphones, the researchers fine-tuned SpeechStew to achieve accuracy in line with a much more sophisticated model.

Transfer learning entails transferring knowledge from one domain to a different domain with less data, and it has shown promise in many subfields of AI. By taking a model like SpeechStew that's designed to understand generic speech and refining it at the margins, it's possible for AI to, for example, understand speech in different accents and environments.

Future applications

When VentureBeat asked via email how speech models like SpeechStew might be used in production -- like in consumer devices or cloud APIs -- the researchers declined to speculate. But they envision the models serving as general-purpose representations that are transferrable to any number of downstream speech recognition tasks.

"This simple technique of fine-tuning a general-purpose model to new downstream speech recognition tasks is simple, practical, yet shockingly effective," the researchers said. "It is important to realize that the distribution of other sources of data does not perfectly match the dataset of interest. But as long as there is some common representation needed to solve both tasks, we can hope to achieve improved results by combining both datasets."