Amazon researchers cut Alexa skill selection error rate by 40%

Researchers at Amazon have managed to improve Alexa's ability to choose third-party apps, or skills, by using a novel data representation technique. In a blog post and accompanying paper ("Coupled Representation Learning for Domains, Intents and Slots in Spoken Language Understanding"), Young-Bum Kim, an Amazon science leader in the Seattle company's Alexa AI division, and team describe a scheme devised for natural language tasks that can cut Alexa's skill selection error rate by 40 percent.

Their work will be presented at the IEEE Spoken Language Technologies conference in Athens, Greece later this month, and comes on the heels of research last week that was shown to improve Alexa's speech recognition up to 15 percent.

"In recent years, data representation has emerged as an important research topic within machine learning," Kim wrote. "Natural language understanding (NLU) systems, for instance, rarely take raw text as inputs. Rather, they take embeddings, data representations that preserve semantic information about the text but present it in a consistent, formalized way. Using embeddings rather than raw text has been shown time and again to improve performance on particular NLU tasks."

The new representation method takes advantage of the way Alexa handles requests. As Kim explains, Alexa categorizes requests first by their subject area, or domain (for example, music or weather), and next by intent, or the intended action. Finally, they're classified according to slot type -- the entity defining how Alexa recognizes and handles data. (A skill that uses the actor slot type might query filmographies with the names of supplied actors and actresses, for example.)

Kim and coauthors leveraged the natural hierarchy of classifications to build an AI model that produces slot representations, intent representations, and domain representations. It's a multistep process. First, utterances are passed through what the researchers call a "de-lexicalizer," which substitutes generic slot names for slot-values. (A command like "play 'Nice for What' by Drake" becomes "play SongName by SongArtist.") Those slot-values move onto an embedding layer that converts them into vectors -- mathematical representations -- such that words with similar meanings are clustered together.

The embeddings are then passed to a bidirectional long short term memory (LSTM) network, a category of recurrent neural networks capable of learning long-term dependencies. As Kim notes, it's an architecture widely used in NLP because of its knack for learning to "interpret words in light of their positions in a sentence."

All told, the researchers trained the AI system on 246,000 utterances covering 17 domains.

To test its precision, they used its encodings as inputs to a two-stage skill selection system. In experiments, it not only boosted accuracy from 90 percent to 94 percent, according to Kim, but managed to outperform three similar systems of their own design.

"We test our scheme on the vital task of skill selection, or determining which Alexa skill among thousands should handle a given customer request," she wrote. "We find that our scheme cuts the skill selection error rate [substantially], which should help make customer interactions with Alexa more natural and satisfying."

More