Google AI researchers are applying computer vision to sound wave visuals to achieve state-of-the-art speech recognition system performance without the use of a language model. Researchers say the SpecAugment method requires no additional data and can be used without adaption of underlying language models.
“An unexpected outcome of our research was that models trained with SpecAugment out-performed all prior methods even without the aid of a language model,” Google AI resident Daniel S. Park and research scientist William Chan said in a blog post today. “While our networks still benefit from adding a language model, our results are encouraging in that it suggests the possibility of training networks that can be used for practical purposes without the aid of an language model.”
SpecAugment works in part by applying visual analysis data augmentation to spectrograms, visual representations of speech. SpecAugment was applied to Listen, Attend, and Spell networks for speech recognition tasks to achieve 2.6% word error rate (WER) with LibriSpeech960h, a collection of about 1,000 hours of spoken English, and 6.8% word error rate with the Switchboard 300h collection of 260 hours of telephone conversations in English.
Automatic speech recognition (ASR) systems translate speech into text for conversational AI like Google Assistant in Home smart speakers or Android smartphones using Gboard’s dictation tool for email or text message. Reductions in word error rates can be a key factor in conversational AI adoption rates, according to a 2018 PricewaterhouseCoopers survey.
Advances in language models and compute power have driven reductions in word error rates that in recent years, for example, have made typing with your voice faster than your thumbs.
The achievement was detailed in “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” a paper published on arXiv on April 18.
Continuous improvement is part of the pitch makers of assistants like Alexa frequently make, but Google and Amazon have shared a number of papers in recent months detailing methods used to accelerate change.
Isolation of background noise may improve Alexa’s speech recognition rates up to 15%, the company announced today, while semi-supervised training methods will be applied to improve Alexa speech recognition later this year that’s expected to garner improvements of more than 20%.