Amazon details AI that guesses which Alexa skill to launch from vague commands

Alexa somewhat recently gained what Amazon calls "name-free skill interaction," which enables it to parse intent from requests that don't explicitly name third-party voice apps. (For instance, "Alexa, get me a car" might launch Uber, Lyft, or some other ride-hailing service.) But as scientists at the Seattle company's Alexa AI research division note, it's more challenging than it seems on the surface -- the AI system that maps utterances to skills (dubbed "Shortlister") would ideally need to be retrained from scratch each time new skills are added to the Alexa Skills Store.

Fortunately, they managed to devise a labor-saving alternative described in a new paper ("Continuous Learning for Large-scale Personalized Domain Classification") scheduled to be presented at the North American Chapter of the Association for Computational Linguistics in New Orleans. It entails "freezing" the settings of the AI model to add new components that accommodate new skills and training these new components only on data pertaining to them.

An Amazon spokesperson told VentureBeat it's being implemented in production "on a limited basis" -- i.e., not for all of the roughly 90,000 available Alexa skills just yet.

The researchers' approach relies on embeddings, which represent data as vectors (sequences of coordinates) of a fixed size that defined points in a multidimensional space, where items with similar properties are grouped near each other. For the sake of efficiency, embeddings are stored in a large lookup table and loaded at run time.

Machine learning models like Shortlister comprise layers of interconnected functions called nodes or neurons, which are loosely modeled after brain cells. The connections among them have weights indicating their relative importance (and by extension, the strength of the influence of their outputs on the next neuron's computation), which are iteratively modified during training.

Shortlister consists of three modules:

One that produces a vector representing an Alexa user's command
A second that uses embeddings to represent all skills a user has enabled (about 10, on average) and that produces a single summary vector of enabled skills
A third that maps inputs (customer utterances, combined with enabled-skill information) and outputs (skill assignments) to the same vector space and finds the output vector that best approximates the input vector.

A second network -- HypRank, short for hypothesis ranker -- refines the list from fine-grained contextual information.

When a new skill is added to Shortlister, the embedding table is addended with a corresponding row. (A single row of nodes corresponds to a single skill, and each added node is connected to all nodes in the layer beneath it.) Next, the weights of all the network's connections (excepting those of the new output node) are frozen, and the new embedding and node are trained just on data associated with the skill.

In part to prevent "catastrophic forgetting," or the tendency of a network to abruptly forget previously learned information upon learning new information, Shortlister evaluates new skills' embeddings not just on how well the network as a whole classifies the new data, but on how consistent they are with existing embeddings. Additionally, it ingests small samples of data from each of the existing skills chosen for their representativeness.

In experiments involving a training data set of 900 skills and a retraining data set of 100 new skills, the best-performing version of Shortlister (of six versions total) achieved 88% accuracy on existing skills, the researchers report, only 3.6% lower than that of the model retrained from scratch.

More