Google today explained how its researchers have improved the speech recognition systems underlying the transcription for voicemails in Google Voice. Last month Google disclosed that the recognition error rate in Google Voice had gone down by 50 percent, and now Google is talking about how it achieved that success.
In short, Google rebuilt the transcription system. The old one relied on a common machine learning technique known as a Gaussian Mixture Model. The new version uses a type of artificial intelligence called deep learning, specifically Long Short-Term Memory Recurrent Neural Networks, Google research scientist Françoise Beaufays explained in a blog post today.
Artificial neural networks can be trained on large amounts of data, like voicemail messages, and can then make inferences on new data that they receive. In this case, Google obtained lots of voicemails from its users to train on. Beaufays explained:
We decided to retrain both the acoustic and language models, and to do so using existing voicemails. We already had a small set of voicemails users had donated for research purposes and that we could transcribe for training and testing, but we needed much more data to retrain the language models. So we asked our users to donate their voicemails in bulk, with the assurance that the messages wouldn’t be looked at or listened to by anyone — only to be used by computers running machine learning algorithms. But how does one train models from data that’s never been human-validated or hand-transcribed?
We couldn’t just use our old transcriptions, because they were already tainted with recognition errors — garbage in, garbage out. Instead, we developed a delicate iterative pipeline to retrain the models. Using improved acoustic models, we could recognize existing voicemails offline to get newer, better transcriptions the language models could be retrained on, and with better language models we could recognize again the same data, and repeat the process.
Google has used deep learning to bolster many of its services, including Google Translate. And at the Google I/O conference, executive Sundar Pichai — who was named chief executive of Google yesterday under the new Alphabet umbrella company — announced that Google now has a speech recognition error rate of just 8 percent thanks to advancements in deep learning.
Now Google Voice transcriptions are more accurate as a result of deep learning, posing fresh challenges for other companies doing speech recognition, including Apple, with Siri, and Microsoft, with Cortana.
Check out Beaufays’ full blog post to learn more.