It’s a well-established fact that two mics are better than one when it comes to speech recognition. Intuitively, it makes sense: Sound waves reach multiple microphones with different time delays, which can be used to boost the strength of a signal coming from a certain direction while diminishing signals from other directions. Historically, however, the problem of speech enhancement — separating speech from noise — has been tackled independently from speech recognition, an approach the literature suggests yields substandard results.
But researchers at Amazon’s Alexa division believe they’ve developed a novel acoustic modeling framework that boosts performance by unifying speech enhancement and speech recognition. In experiments — when applied to a two-microphone system — they claim that their model reduces speech recognition error rates by 9.5% relative to a seven-mic system using older methods.
They describe their work in a pair of papers (“Frequency Domain Multi-Channel Acoustic Modeling for Distant Speech Recognition,” “Multi-Geometry Spatial Acoustic Modeling for Distant Speech Recognition”) scheduled to be presented at the International Conference on Acoustics, Speech, and Signal Processing in Brighton next month.
The first paper describes a multi-microphone method that replaces separate, hand-coded algorithms that determine beamformers’ (spatial filters that operate on the output of sensors to enhance the amplitude of a wave) directions and identify speech signals with a single neural network. Amazon’s current Echo speaker lineup tweaks beamformers on the fly to adapt to new acoustic environments. But by training the single model on a large corpus from various environments, the researchers were able to do away with the adaptation step.
“Classical … technology is intended to steer a single [sound beam] in an arbitrary direction, but that’s a computationally intensive approach,” Kenichi Kumatani, a speech scientist in the Alexa Speech group, explained in a blog post. “With the Echo smart speaker, we instead point multiple beamformers in different directions and identify the one that yields the clearest speech signal … That’s why Alexa can understand your request for a weather forecast even when the TV is blaring a few yards away.”
Both the single neural network and traditional model pass on the output of the beamformers to a feature extractor in the form of log filter-bank energies, or snapshots of signal energies in multiple, irregular frequency bands. In the case of the traditional model, they’re normalized against an estimate of the background noise, and the extractor’s output is passed to an AI system that computes the probabilities of features corresponding to different “phones,” or short units of phonetic information.
According to the papers’ authors, performance improves if each component of the model (e.g., the feature extractor and beamformers optimizer) is initialized separately. They add that diverse training data enables the model to handle a range of microphone configurations across device types.
“Among other advantages, this means that the ASR systems of new devices, or less widely used devices, can benefit from interaction data generated by devices with broader adoption,” Kumatani said.