Last week during an event in Seattle, Amazon unveiled a host of features heading to new and existing smart speakers powered by its Alexa voice platform. One of them was “whisper mode,” which enables Alexa to respond to whispered speech by whispering back. In a blog post published today, Zeynab Raeesy, a speech scientist in Amazon’s Alexa Speech group, revealed the feature’s artificial intelligence (AI) underpinnings.
Much of the work is detailed in a paper (“LSTM-based Whisper Detection”) that will be presented at the IEEE Workshop on Spoken Language Technology in December.
“If you’re in a room where a child has just fallen asleep, and someone else walks in, you might start speaking in a whisper, to indicate that you’re trying to keep the room quiet. The other person will probably start whispering, too,” Raeesy wrote. “We would like Alexa to react to conversational cues in just such a natural, intuitive way.”
What makes whispered speech difficult to interpret, Raeesy explained, is the fact that it’s predominantly unvoiced — that is to say, it doesn’t involve the vibration of the vocal cords. It also tends to have less energy in lower frequency bands than ordinary speech.
She and colleagues investigated the use of two different neural networks — layers of mathematical functions loosely modeled after the human brain’s neurons — to distinguish between normal and whispered words.
The two neural networks differed architecturally — one was a multilayer perceptron (MLP) and the second was a long short-term memory (LSTM) network, which process inputs in sequential order — but were trained on the same data. Said data consisted of (1) log filter-bank energies, or representations of speech signals that record the signal energies in different frequency ranges, and (2) a set of features that “exploit[ed] the signal differences between whispered and normal speech.”
In testing, they found the LSTM generally performed better than the MLP, conferring a number of advantages. As Raeesy explained, other components of Alexa’s speech recognition engine rely entirely on log filter-bank energies, and sourcing the same input data for different components makes the entire system more compact.
It wasn’t all smooth sailing, though — at least initially. Because Alexa recognizes the end of a command or reply by a short period of silence (a technique known as “end-pointing”), the LSTM’s confidence tended to fall off toward the tail end of utterances. To solve the problem, the researchers averaged the LSTM’s outputs for the entire utterance; in the end, dropping the last 1.25 seconds of speech data was “crucial” to maintaining performance.
Whisper mode will be available in U.S. English in October.