Amazon improves Alexa's ability to recognize conversation topics by 35%

As any HomePod, Google Home, or Echo owner can tell you, getting a smart speaker to understand you -- much less suss out the topic of a conversation -- is usually a crapshoot. But encouragingly, researchers at Amazon are progressing toward more responsive, contextually aware voice experiences, in part thanks to "topic modeling" -- i.e., identifying topics to help route requests more accurately.

In new research, they developed a prototypical system that can boost Alexa's topic recognition by up to 35 percent. It's described in a paper that'll be presented at the IEEE Spoken Language Technologies conference in Athens, Greece in late December.

"[Our] system uses two additional sources of information to determine the topic of a given utterance: the utterances that immediately preceded it and its classification as a 'dialogue act'," Behnam Hedayatnia, an applied speech scientist at Amazon, wrote in a blog post.

To validate the AI system, the researchers used more than 100,000 annotated voice requests collected during the 2017 Alexa Prize competition, which tasked 15 teams with deploying Alexa chatbot systems. Annotators labeled the training data with one of 14 dialogue acts and 12 topic labels -- such as Politics, Entertainment/Movies, Fashion, and Entertainment/Books -- and noted the keywords in commands that helped them identify topics. (For instance, "brand" and "Italy" in "Gucci is a famous brand from Italy.")

The topic-modeling system comprises three AI architectures: (1) a deep averaging network (DAN), (2) a variation on the DAN that learns to predict keywords indicated in topics, and (3) a bidirectional long-short-term memory (LSTM) network. Bidirectional LSTMs are a category of recurrent neural network capable of learning long-term dependencies; they allow the neural networks to combine their memory and inputs to improve prediction accuracy.

Inputs to all three networks consist of a voice command, a dialogue act classification, and a conversational context -- in other words, the last five turns in a conversation, where a turn is a combination of a speaker's request and a chatbot's response.

The DAN produces embeddings -- mathematical representations -- of words, and subsequently of sentences, by averaging the word embeddings. Those sentence embeddings are averaged together again to produce a single summary embedding, which is appended to the embedding of the current voice command and passed to a neural network that learns to correlate embeddings with topic classifications.

The ADAN, meanwhile, builds a matrix that maps every word it encounters against each of the 12 topics it's asked to recognize and records how often annotators correlated a particular word with a particular topic. It simultaneously embeds words from the current voice command and past commands and averages them together before averaging the averages.

In the end, each word has 12 numbers associated with it -- a 12-dimensional vector -- indicating its relevance to each topic. The vectors associated with words from current voice summaries are combined with vectors from past summaries and passed to the neural network for classification.

In testing, four versions of the system improve voice recognition accuracy over the baseline. One configuration achieved accuracy of 74 percent, up from 55 percent for baseline.

More