Connect with top gaming leaders in Los Angeles at GamesBeat Summit 2023 this May 22-23. Register here.


Text normalization is a fundamental processing step in most natural language systems. In the case of Amazon’s Alexa, “Book me a table at 5:00 p.m.” might be transcribed by the assistant’s automatic speech recognizer as “five p m” and further reformatted to “5:00PM.” Then again, Alexa might convert “5:00PM” to “five thirty p m” for its text-to-speech synthesizer.

So how does this work? Currently, Amazon’s voice assistant relies on “thousands” of handwritten normalization rules for dates, email addresses, numbers, abbreviations, and other expressions, according to Alexa AI group applied scientist Ming Sun and Alexa Speech machine learning scientist Yuzong Liu. That’s all well and fine for English, but because the approach isn’t particularly adaptable to other languages (without lots of manual labor), Amazon scientists are investigating a more scalable technique driven by machine learning.

In a preprint paper (“Neural Text Normalization with Subword Units”) scheduled to be presented at the North American Chapter of the Association for Computational Linguistics (NAACL), Sun, Liu, and colleagues describe an AI text normalization system that breaks words in input and output streams into smaller strings of characters called subword units. These subword units, Sun and Liu explain in a blog post, reduce the number of inputs the machine learning model must learn and clear up ambiguity in snippets like “Dr.” (which could mean “doctor” or “Drive”) and “2/3” (which could mean “two-thirds” or “February third”).

Furthermore, the subword units enable the AI model to better handle input words it hasn’t seen before. Unfamiliar words might contain familiar subword components, and these are sometimes enough to help the model decide on a course of action.

Event

Transform 2023

Join us in San Francisco on July 11-12, where top executives will share how they have integrated and optimized AI investments for success and avoided common pitfalls.

 

Register Now

The researchers’ system created subword units by reducing words in a training data set to individual characters, which an algorithm ingested to identify the most commonly occurring two-character units and three-character units until it reached capacity (around 2,000 subwords). The components were used to train an AI system to output subword units, which a separate algorithm stitched together into complete words.

After training the system on 500,000 examples from a public data set, the researchers say it achieved a 75% reduction in error rate compared with the best-performing machine learning system previously reported and a 63% reduction in latency, or the time it takes to receive a response to a single request. By factoring in additional information, such as parts of speech, position within the sentence, and capitalization, it managed a further error rate reduction of 81% and a word error rate of just 0.2%.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.