Automatic speech recognition systems like those at the core of Alexa convert speech into text, and one of their components is a model that predicts which word will come after a sequence of words. They’re typically n-gram based, meaning they suss out the probability of next words given the past n-1 words. But architectures like recurrent neural networks, which are commonly used in speech recognition because of their ability to learn long-range dependencies, are tough to incorporate into real-time systems and often struggle to ingest data from multiple corpora.

That’s why researchers at Amazon’s Alexa research division investigated techniques to make such AI models more practical for speech recognition. In a blog post and accompanying paper (“Scalable Multi Corpora Neural Language Models for ASR”) scheduled to be presented at the upcoming Interspeech 2019 conference in Graz, Austria, they claim they can reduce word recognition error rate by 6.2% over the baseline.

The researchers tackled the problem of data scarcity by building conventional models both for in-domain and out-of-domain training data sets, which they combined linearly. They assigned each corpus a score to measure its relevance to the in-domain data, which determined the likelihood that a sample would be selected for a supplementary data set. Then they applied transfer learning, a method where a model developed for a task is reused as the starting point for a model on a second task, to learn the AI model.

The researchers next passed data through a speech recognizer with an n-gram language model to refine its prediction using the AI model. To minimize the risk that the conventional model would reject hypotheses that the AI model would consider, they used the latter to generate synthetic data, which provided training data for the first-pass model.

VB Transform 2020 Online - July 15-17. Join leading AI executives: Register for the free livestream.

Samples within the training data come in pairs of words rather than individual words as part of a scheme called noise contrastive estimation, where one of the paired words is the true target while the other word is randomly selected. The model is tasked with learning to tell the difference by directly estimating the probability of the target word.

The researchers lastly quantized the weights of the AI model to increase its efficiency further. (“Weights” in this context refer to the synaptic strength of the nodes within the system, which receive data from other nodes and transform it before passing it along to others.) Quantization considers a full range of values a particular variable can take on and splits it into a fixed number of intervals, such that all values within an interval are approximated by a single number. According to the team, thanks to quantization, the AI model increased speech processing time by no more than 65 milliseconds in 50% of cases and no more than 285 milliseconds in 90% of cases.