Alexa speech normalization AI reduces errors by up to 81%

Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon's Alexa, "Book me a table at 5:00 p.m." might be transcribed by the assistant's automatic speech recognizer as "five p m" and further reformatted to "5:00PM." Then again, Alexa might convert "5:00PM" to "five thirty p m" for its text-to-speech synthesizer.

So how does this work? Currently, Amazon's voice assistant relies on "thousands" of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu. That's all well and fine for English, but because the approach isn't particularly adaptable to other languages (without lots of manual labor), Amazon scientists are investigating a more scalable technique driven by machine learning.

In a preprint paper ("Neural Text Normalization with Subword Units") scheduled to be presented at the North American Chapter of the Association for Computational Linguistics (NAACL), Sun, Liu, and colleagues describe an AI text normalization system that breaks words in input and output streams into smaller strings of characters called subword units. These subword units, Sun and Liu explain in a blog post, reduce the number of inputs the machine learning model must learn and clear up ambiguity in snippets like “Dr." (which could mean "doctor" or "Drive") and "2/3" (which could mean "two-thirds" or "February third").

Furthermore, the subword units enable the AI model to better handle input words it hasn't seen before. Unfamiliar words might contain familiar subword components, and these are sometimes enough to help the model decide on a course of action.

The researchers' system created subword units by reducing words in a training data set to individual characters, which an algorithm ingested to identify the most commonly occurring two-character units and three-character units until it reached capacity (around 2,000 subwords). The components were used to train an AI system to output subword units, which a separate algorithm stitched together into complete words.

After training the system on 500,000 examples from a public data set, the researchers say it achieved a 75% reduction in error rate compared with the best-performing machine learning system previously reported and a 63% reduction in latency, or the time it takes to receive a response to a single request. By factoring in additional information, such as parts of speech, position within the sentence, and capitalization, it managed a further error rate reduction of 81% and a word error rate of just 0.2%.

More