Amazon researchers boost Alexa's ability to understand complex commands

Amazon's Alexa is becoming more proficient at understanding multistep requests in one shot. In a paper ("Parsing Coordination for Spoken Language Understanding") and accompanying blog post published this morning, Sanchit Agarwal, an applied scientist in the Alexa AI organization, detailed a spoken-language understanding (SLU) system that maps voice commands to actions (intents) and entities (slots) 26 percent more accurately than off-the-shelf alternatives.

Agarwal and colleagues' work will be presented at the upcoming IEEE Spoken Language Technology conference in Athens, Greece later this month. News of their research comes a day after Amazon scientists described an AI-driven method that can cut Alexa’s skill selection error rate by 40 percent.

"Narrow [SLU systems] usually have rigid constraints, such as allowing only one intent to be associated with an utterance and only one value to be associated with a slot type," he wrote. "We [propose] a way to enable SLU systems to understand compound entities and intents."

As Agarwal explained, he and colleagues used a deep neural network -- layers of mathematical functions called neurons, loosely modeled on their biological equivalents -- that was "taught" from structures in spoken-language data. First, a corpus was labeled according to a scheme indicating groups of words, or “chunks," that should be treated as ensembles: "B" to indicate the beginning of a chunk, "I" to indicate the inside of a chunk, or "O" to indicate a word that lies outside a chunk. Then, prior to training, the words underwent embedding, a process that involved substituting vectors to represent them.

The embeddings were next passed to a bidirectional long-short-term memory (bi-LSTM) network, a type of recurrent neural network capable of learning long-term dependencies, which output a contextual embedding of each word in the input sentence. Those outputs were combined with a neural network layer that mapped each embedding to a distribution over the output "B," "I," and "O" labels, classifying each word of the input according to its most probable output label.

An additional layer, known as a conditional-random-field, or CRF, learned to associate between the output labels and choose the most likely labels from all possible sequences. Thanks to a technique called adversarial training -- during which the network was evaluated on how well or poorly it predicted the labels -- the model learned to generalize.

"Instead of building separate parsers for different slot types (such as ListItem, FoodItem, Appliance, etc.), we built one parser that can handle multiple slot types," Agarwal said. "For example, our parser can successfully identify [list items] in the utterance 'add apples peanut butter and jelly to my list' and [appliances] in the utterance 'turn on the living room light and kitchen light'."

More