Humans have an easy time with context carryover — the ability to track references through rounds of conversation (like inferring the meaning of “there” in a follow-up to “How is the weather in San Francisco?”) — but it’s beyond the conversational repertoire of most smart speakers and virtual assistants. That’s nothing a little artificial intelligence (AI) can’t fix, though, and in a blog post today researchers at Amazon wrote about the progress they’ve made with Alexa.
During a typical chat with Alexa, users invoke multiple apps — skills, in Amazon’s parlance — in successive questions. Currently, Alexa analyzes the content of each query according to its domain (type of skill), intent (function of the skill), and slot (a variable the skill acts upon). But skills often repurpose slots like “town” and “city,” which poses an obvious problem — if a user asks for directions and follows up with a question about a restaurant’s location, how’s Alexa to know which thread to reference in its answer?
Gupta and colleagues describe a solution in the paper “Contextual Slot Carryover for Disparate Schemas,” a neural network that automatically learns to map one skill’s slots to another’s. The findings will be presented at the upcoming Interspeech conference in Hyderabad, India in September.
The system consists of two parts, Gupta explained: (1) an encoder that produces summary vectors from conversation data, and (2) a decoder that calculates a confidence score for each slot. The two together “learn” slot names and all of their possible values to create embeddings — i.e., points in geometric space that represent strings of words — and then leverage a long short-term memory (LSTM) encoder (a neural network that accounts for the sequencing of data) to identify words helpful in determining candidate slot mapping.
During the encoding stage, the “utterance histories” of both the user and Alexa are combined into a single vector, the outputs of which are passed along to the decoder. With the help of a mechanism that guides it in focusing on the user or Alexa’s utterances, the decoder chooses whether to carry a slot over or not.
“When the system is in use, we use proximity in embedding space to generate a list of candidate mappings between every slot encountered in the conversation so far and the slots available in the currently invoked skill,” Gupta wrote. “Each of these candidates is then fed into the encoder, along with other features, such as the recent history of the customer’s utterances, the recent history of Alexa’s responses, and the inferred intent of the customer’s most recent utterance.”
Gupta and coauthors found that compared to a system of hardcoded slot maps, the neural network offers “slight” improvements in performance and had “significantly [higher] recall.”
“Overall, according to the F1 score, which combines recall and precision (a measure of the false-positive rate), our system outperformed the rule-based system by roughly 9 percent,” Gupta wrote.
Context carryover for Alexa, which was announced in April, is “being phased [in],” Gupta said.
It’s worth noting that Google unveiled a similar feature for the Google Assistant during its I/O developer conference in May: Continued Conversation. It came to Google Home speakers in June and allows users to ask follow-up questions without needing to use mulitple wake phrases.
The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here