Amazon scientists improve Alexa's ability to understand context

Pinpointing the person, place, or thing someone's referring to in a question has been a longstanding challenge for systems like those underpinning Alexa. One trick often used to solve this reference resolution problem is slot carryover, which employs syntactic context to narrow down the target noun. For instance, if someone asks "When is Lion King playing at the Bijou?" and follows up with the question "Is there a good Mexican restaurant near there?" "there" is presumed to be "Bijou."

Slot carryover generally works without a hitch in narrowly targeted applications, but when it comes to AI whose knowledge bases span hundreds (or thousands) of domains (like Alexa), things become more complicated. Different services use different slots for the same data, and over the course of a conversation, natural language understanding models must suss out which slots used by one service (for example, a movie-finding service that tags location data with the slot name "Theater_Location") should inherit data from slots used by another.

That's why research scientists at Amazon's Alexa AI division recently investigated novel models that learn to carry slots from previous turns of dialogue over to the current turn. By making independent judgments about each slot value, the models managed to significantly outperform a rule-based baseline system in validation tests.

As research coauthors Chetan Naik and Pushpendre Rastogi explained in a post on Amazon's Alexa Blog this morning, slot values are correlated in many Alexa services such that a strong likelihood of carrying over one slot value implies a strong likelihood of carrying over another. For example, U.S. services have slots for both city and state, and if one of those slots is carried over from one dialogue turn to another, it's very likely the other should, too.

The team exploited these correlations in two machine learning systems, the first a pointer network based on long short-term memory (LSTM) architecture. One bidirectional LSTM -- an encoder -- converted sequential input data into mathematical representations (vectors) by processing it forward and backward, while a second LSTM -- a decoder -- converted it back into a data sequence. Effectually, the pointer network outputs subsets of slots to be carried over from previous rounds of dialogue.

The other architecture the team considered tapped the same encoder as the first, but it replaced the pointer-generator decoder with a self-attention decoder based on Google's Transformer architecture. This special self-attention mechanism enabled it to learn which data to emphasize when deciding how to handle a given input, chiefly by comparing each input slot to all the other slots that had been identified in preceding turns of dialogue.

The researchers tested their systems using two corpora (one a standard public benchmark and the other an internal data set), and they found that the new architectures modeled slot interdependencies better than systems Alexa AI scientists detailed last year. The Transformer system performed slightly better than the pointer generator, but it had an edge in recognizing slot correlations across longer spans of dialogue.

The work won two best-paper awards at the Association for Computational Linguistics' Workshop on Natural-Language Processing for Conversational AI, but Naik, Rastogi, and colleagues aren't resting on their laurels. In fact, they're already plotting to improve slot carryover further with larger data sets and techniques like transfer learning.

More