The key to a more human-like Amazon Alexa is unsupervised learning

Alexa, Amazon's intelligent assistant that's in well over 100 million devices and works with over 60,000 appliances from 7,400 brands, gains new skills from contributions by the thousands of employees tinkering away at its backend systems. But there's a limit to what they can accomplish, owing to the way machine learning algorithms -- the statistical models underpinning Alexa's decision-making -- improve.

That's why scientists at Amazon's Alexa AI research division are pursing semi-supervised and unsupervised techniques, in which AI systems learn to make predictions without ingesting gobs of annotated data. Semi-supervised and unsupervised learning have their limitations, too, but both promise to supercharge Alexa's capabilities by imbuing a human-like capacity for inference.

"[What we're] chasing is self-learning -- that's where our focus is," Ruhi Sarikaya, Amazon's director of applied science and Alexa Machine Learning, told VentureBeat in an interview. "[We're] expanding it to several domains."

Teaching oneself

The process of solving problems with machine learning typically begins with annotated data for which the target answer is already known. The data -- part of a larger data set, or corpus -- exhibits features that are identified through feature engineering and are labeled by hand. Models in this paradigm learn mathematical relationships such that they're able to anticipate answers to unfamiliar questions, at which point their predictions can be checked against ground truth for accuracy.

Data labeling has given rise to a cottage industry dominated by startups like Hive and Alegion, as well as Scale AI, which recently raised $100 million and nabbed customers including OpenAI, the Toyota Research Institute, Uber, NuTonomy, and Google parent company Alphabet's Waymo. Amazon contracts third-party firms to annotate thousands of hours of audio each day from Alexa devices for quality assurance and R&D.

But labeling remains a time-consuming chore for Amazon's Alexa research teams, which often deal with data sets comprising millions of verbal requests and replies. Plus, it's impractical in domains where samples are hard to come by -- a fact underlined by Cleo and Alexa Answers, two Amazon services that crowdsource answers to questions designed to expand Alexa's base of knowledge.

"The complexity of systems like Alexa -- in general, conversational systems -- is increasing because the complexity of the world around us is increasing," said Sarikaya. "We have more devices, better internet connectivity, and better sensors that are collecting signals from the environment and serving as a kind of digital nervous system [...] This complexity is creating friction for the customer, [and the] current methods that we use involving ground truth data and labeling are not going to be useful."

Although a question like "What is the temperature?" might seem straightforward on its face, it's hopelessly fraught in Alexa's eyes. That's because "temperature" might refer to a connected thermostat's setting or the temperature of a smart oven, a room, or the air outdoors.

Normally, training an AI system to correctly interpret "temperature," given variables like the time of day, the devices in a room, and the questioner's habits, would require isolating important features and carefully annotating each. But in an unsupervised approach, a model can learn to draw conclusions from contextual clues.

Consider that sentences in an Alexa command can be embedded in a high-dimensional space, where they can be grouped together according to how frequently the words within them co-occur with other words. Unsupervised algorithms are able to extrapolate from the labeled sentences to effectively label the unlabeled sentences in the same clusters, expanding the number of training examples available to other models.

Short of fully unsupervised learning, there's semi-supervised learning, and one of the most common flavors is self-training. That's where an AI system trained on a small amount of labeled data applies labels to a larger set of unlabeled data. Machine learning models' outputs have associated confidence scores, and in semi-supervised self-training, the outputs of the system are sorted according to the confidence score. Those that fall within a predefined range are used to train the system further.

Signs of improvement

Such techniques have already made their way into production, albeit in a limited fashion. If Alexa customers in the U.S., Canada, Australia, the U.K., and India ask the assistant something like "Alexa, turn on the Sofa Lights," but the lights they're trying to turn on are actually named "Living Room Lights," Alexa might helpfully suggest "Did you mean Living Room Lights?"

"Context is super important. Your conversation doesn't stop when you move from the kitchen to the living room and sit down to watch a movie," Amazon devices and services SVP Dave Limp told a gaggle of reporters at Amazon's re:MARS conference in Las Vegas earlier this year. "We've started rolling out this sense of context -- [features] that figure out where in the house you are. If you walk into a room with an Echo [smart speaker] or smart home device, you don't have to say 'Turn the lights on in the kitchen' or 'Turn the lights on in the living room,' because they've been associated with each other over time. Now, you just say 'Turn on the lights' and the appropriate lights go on, or you can go into the living room and say 'Watch XYZ on Netflix' and it'll automatically turn the TV on because [Alexa] knows I'm in that room and that endpoint knows to do that."

Efforts to improve Alexa's anticipatory prowess dovetail with the evolution of Alexa Hunches, which proactively recommends actions based on data from connected devices and sensors. For example, if you say "Alexa, goodnight," the assistant might reply, "By the way, your living room light is on. Do you want me to turn it off?"

Amazon VP of smart home Daniel Rausch told VentureBeat that Hunches, which began with smart lights but is expanding to other devices, is a natural fit for self-guided learning. "If you look at the data for my house, for example, you'll see there's a very predictable pattern," he said. "At night when we go to bed, most of my devices are in the state I prefer to have them in, but you'll also see some anomalies -- perhaps I left the basement light on or forgot to lock the door. We taught Alexa to build those inferences herself and then off them to me."

Beyond the smart home domain, unsupervised and semi-supervised methods are informing Alexa's selection of Skills, which comprise the over 90,000 voice apps from 325,000 developers available in the Alexa Skills Store. When a customer supplies a request that requires a third-party service or integration, Alexa automatically chooses from thousands of skills, using a recommender system akin to the product suggestion engine on Amazon.com. Late last year, scientists at the company rolled out a model that considers intended skills -- linked skills invoked when a user requests something -- to bolster skills suggestion accuracy by 12%.

"The key is using the technique that’s right for the type of problem, whether it’s examining a behavioral pattern or trying to establish semantic similarity with ground truths, and then tuning a meta-model that takes those individual signals into account, producing a user experience that's helpful instead of one that makes assumptions," said Smith. "The context is that we're trying to build toward a world where Alexa understands you in a much more natural way, rather than training people to talk in Alexa's terms. If we have a pretty good idea of what you're saying, we'll simply perform the intended task, but what we're evolving toward is a model where Alexa gets ground truths from customers."

In another example of unsupervised learning transforming the ways in which Alexa's models are learned, Amazon researchers described a technique that tapped 250 million unannotated customer interactions to reduce speech recognition errors by 8%. Two semi-supervised learning techniques yielded greater gains: Using an acoustic model trained on 7,000 hours of labeled data and 1 million hours of unannotated data, Amazon scientists managed to cut error rates by 10% to 22%. Meanwhile, a separate team reduced errors by 20% with 800 hours of annotated data and 7,200 hours of "softly" unlabeled data that contained artificially generated noise.

"Instead of relying on human annotators to tell us what the ground truth is [for] these exponentially growing permutations, the machine learning algorithms themselves tell us what the ground truth is [without] the underlying perception, and without the underlying intent and entities," said Sarikaya.

He laid out a real-world example: Initially, when Alexa customers would say "Alexa, play ABCs," referring to the English language song set to the Mozart's "Twelve Variations on Ah, Vous Dirai-Je, Maman," Alexa didn't know how to interpret it. "Alphabet songs" run the gamut from phonics songs (which teach the different sounds associated with each letter) to acrostic songs (which contain lyrics running sequentially through the alphabet), not to mention regional variations compensating for different pronunciations.

A portion of customers who stumped Alexa with the alphabet song request opted not to try it again, whereas others reformulated the question (eg, "Alexa, play the alphabet song") in a way Alexa understood. Successful and unsuccessful exchanges logged over the course of several months fed into machine learning algorithms, such that Alexa eventually learned to play the English alphabet song when asked "Alexa, play ABCs."

"Customers want to interact with Alexa as naturally as they do with human beings, but it's so difficult to carry over the context in multi-turn conversations in which speakers drop certain entities and add new ones. It's super hard for machines," said Sarikaya. "With self-learning, you're bringing the customer into the equation. [As the experience improves], it'll lead to more engagement, and more engagement will lead to more data, which will feed to machine learning systems."

Teaching oneself

Signs of improvement

More