Amazon scientist explains how Alexa resolves ambiguous requests

During a blockbuster press event last week, Amazon took the wraps off a redesigned Echo Show, Echo Plus, and Echo Spot, and nine other new other voice-activated accessories, peripherals, and smart speakers powered by Alexa. Also in tow: the Alexa Presentation Language, which lets developers build "multimodal" Alexa apps -- skills -- that combine voice, touch, text, images, graphics, audio, and video in a single interface.

Developing the frameworks that underlie it was easier said than done, according to Amazon senior speech scientist Vishal Naik. In a blog post today, he explained how Alexa leverages multiple neural networks -- layered math functions that loosely mimic the human brain's physiology -- to resolve ambiguous requests. The work is also detailed in a paper ("Context Aware Conversational Understanding for Intelligent Agents with a Screen") that was presented earlier this year at the Association for the Advancement of Artificial Intelligence.

"If a customer says, 'Alexa, play Harry Potter,' the Echo Show screen could display separate graphics representing a Harry Potter audiobook, a movie, and a soundtrack," he explained. "If the customer follows up by saying 'the last one,' the system must determine whether that means the last item in the on-screen list, the last Harry Potter movie, or something else."

Naik and colleagues evaluated three bidirectional long short term memory neural networks (BiLSTM) -- a category of recurrent neural network that's capable of learning long-term dependencies -- with slightly different architectures. (Basically, the memory cells in LSTMs allow the neural networks to combine their memory and inputs to improve their prediction accuracy, and because they're bidirectional, they can access context from both past and future directions.)

Sourcing data from the Alexa Meaning Representation Language, an annotated semantic-representation language released in June of this year, the team jointly trained the AI models to classify commands by either intent, which designates the action a customer wants Alexa to take, or slot, which designates the entities (i.e., an audiobook, movie, or smart home device trigger) the intent acts on. And they fed them embeddings, or mathematical representations of words.

The first of the three neural networks considered both the aforementioned embeddings and the type of content that would be displayed on Alexa devices with screens (in the form of a vector) in its classifications. The second went a step further, taking into account not just the type of on-screen data, but the specific name of the data type (e.g., "Harry Potter" or "The Black Panther" in addition to "Onscreen_Movie"). The third, meanwhile, used convolutional filters to identify each name's contribution toward the final classification's accuracy, and based its predictions on the most relevant of the bunch.

To evaluate the three networks' performance, the researchers established a benchmark that used hard-coded rules to factor in on-screen data. Given a command like "Play Harry Potter," it might estimate a 50 percent and 10 percent probability it refers to the audiobook and soundtrack, respectively.

In the end, when evaluated with four different data sets (slots with and without screen information and intents with and without screen information), all three of the AI models that considered on-screen data "consistently outperform[ed]" both the benchmark and a voice-only test set. More importantly, they didn't exhibit degraded accuracy when trained exclusively on speech inputs.

"[We] verified that the contextual awareness of our models does not cause a degradation of non-contextual functionality," Naik and team wrote. "Our approach is naturally extensible to new visual use cases, without requiring manual rule writing."

In future research, they hope to explore additional context cues and extend visual features to encode screen object locations for multiple object types displayed on-screen (for example, books and movies).