Speech recognition is pretty darn good these days. State-of-the-art models like EdgeSpeechNet, which was detailed in a research paper late last year, are capable of achieving about 97 percent accuracy. But even the best systems sometimes stumble on uncommon and rare words.
To narrow the gap, scientists at Google and the University of California propose an approach that taps a spelling correction model trained on text-only data. In a paper published on the preprint server Arxiv.org (“A Spelling Correction Model for End-to-End Speech Recognition“), they report that in experiments with the 800-word, 960-hour language modeling LibriSpeech dataset, their technique shows a 18.6 percent relative improvement in word error rate (WER) over the baseline. In some cases, it even managed 29 percent error reduction.
“The goal … is to incorporate a module trained on [text] data into the end-to-end framework, with the objective of correcting errors made by the system,” they wrote. “Specifically, we investigate using unpaired … data to [generate] audio signals using a text-to-speech (TTS) system, a process similar to backtranslation in machine translation.”
As the paper’s authors explain, most automatic speech recognition (ASR) systems jointly train three components: an acoustic model that learns the relationship between audio signals and the linguistic units that make up speech, a language model that assigns probabilities to sequences of words, and a mechanism that performs alignment the acoustic frames and recognized symbols. All three use a single neural network (layered mathematical functions modeled after biological neurons) and transcribed audio-text pairs, and as a result, the language model typically suffers degraded performance when it encounters words that infrequently occur in the corpus.
The researchers, then, set out to incorporate the aforementioned spelling correction model into the ASR framework — a model that decodes input and output sentences as sub-word units called “wordpieces,” and that takes the word embeddings (i.e., features mapped to vectors of real numbers) and maps them to higher-level representations. They used text-only data and corresponding synthetic audio signals generated using a text-to-speech (TTS) system (parallel WaveNet) to train an LAS speech recognizer, an end-to-end model first described by Google Brain researchers in 2017, and subsequently to create a set of TTS pairs. Then, they “taught” the spelling corrector to correct potential errors made by the recognizer by feeding it those pairs.
In order to validate the model, the researchers trained a language model, generated a TTS dataset to train the LAS model, and produced error hypotheses to train the spelling correction model with 40 million text sequences from the LibriSpeech dataset, after filtering out 500,000 sequences that contained only single-letter words and those that were shorter than 90 words. They found that, by correcting entries from the LAS, the speech correction model could generate an expanded output with “significantly” lower word error rate.