As it has for the past several years, Amazon on Tuesday unveiled a slew of new devices including a wall-mounted Echo display, a smart thermostat, and kid-friendly, Alexa-powered video chat hardware. Among the most intriguing is Astro, a two-wheeled home robot with a camera that can extend like a periscope on command. But arguably as intriguing are two new software features — Custom Sound Event Detection and Ring Custom Event Alerts — that signal a paradigm shift in machine learning.

Custom Sound allows users to “teach” Alexa-powered devices to recognize certain sounds, like when a refrigerator door opens and closes. Once Alexa learns these sounds, it can trigger during notifications specified hours, like a reminder to close the door so that food doesn’t go bad overnight. In a similar vein, Custom Event Alerts let Ring security camera owners create unique, personalized alert-sending detectors for objects in and around their homes (e.g., cars parked in the driveway). Leveraging computer vision, Amazon claims that Custom Event Alerts can detect objects of arbitrary shapes and sizes.

Both are outgrowths of current trends in machine learning: pretraining, fine-tuning, and semi-supervised learning. Unlike Alexa Guard and Ring’s preloaded object detectors, Custom Sound and Custom Event Alerts don’t require hours of data to learn to spot unfamiliar sounds and objects. Most likely, they fine-tune large models “pretrained” on a huge variety of data — e.g., sounds or objects — to the specific sounds or objects that a user wants to detect. Fine-tuning is a technique that’s been hugely successful in the natural language domain, where it’s been used to develop models that can detect sentiment in social media posts, identify hate speech and disinformation, and more.

“With Custom Sound Event Detection, the customer provides six to ten examples of a new sound — say, the doorbell ringing — when prompted by Alexa. Alexa uses these samples to build a detector for the new sound,” Amazon’s Prem Natarajan and Manoj Sindhwani explain in a blog post. “Similarly, with Ring Custom Event Alerts, the customer uses a cursor or, on a touch screen, a finger to outline a region of interest — say, the door of a shed — within the field of view of a particular camera. Then, by sorting through historical image captures from that camera, the customer identifies five examples of a particular state of that region — say, the shed door open — and five examples of an alternative state — say, the shed door closed.”

Computer vision startups like Landing AI and Cogniac similarly leverage fine-tuning to create classifiers for particular anomalies. It’s a form of semi-supervised learning, where a model is subjected to “unknown” data for which few previously defined categories or labels exist. That’s as opposed to supervised learning, where a model learns from datasets of annotated examples — for example, a picture of a doorway labeled “doorway.” In semi-supervised learning, a machine learning system must teach itself to classify the data, processing the partially-labeled data to learn from its structure.

Two years ago, Amazon began experimenting with unsupervised and semi-supervised techniques to predict household routines like when to switch off the living room lights. It later expanded the use of these techniques to the language domain, where it taps them to improve Alexa’s natural language understanding.

“To train the encoder for Custom Sound Event Detection, the Alexa team took advantage of self-supervised learning … [W]e fine-tuned the model on labeled data — sound recordings labeled by type,” Natarajan and Sindhwani continued. “This enabled the encoder to learn finer distinctions between different types of sounds. Ring Custom Event Alerts uses this approach too, in which we leverage publicly available data.”

Potential and limitations

Unsupervised and semi-supervised learning in particular are enabling new applications in a range of domains, like extracting knowledge about disruptions to cloud services. For example, Microsoft researchers recently detailed SoftNER, an unsupervised learning framework the company deployed internally to collate information regarding storage, compute, and outages. They say it eliminated the need to annotate a large amount of training data and scaled to a high volume of timeouts, slow connections, and other interruptions.

Other showcases of unsupervised and semi-supervised learning’s potential abound, like Soniox, which employs unsupervised learning to build speech recognition systems. Microsoft’s Project Alexandria uses unsupervised and semi-supervised learning to parse documents in company knowledge bases. And DataVisor deploys unsupervised learning models to detect potentially fraudulent financial transactions

But unsupervised and semi-supervised learning don’t eliminate the possibility of errors in a model’s predictions, like harmful biases. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Pretrained models, too, can be rife with major biases. Researchers at Carnegie Mellon University and George Washington University recently showed that that computer vision algorithms pretrained on ImageNet exhibit prejudices about people’s race, gender, and weight.

Some experts including Facebook’s Yann LeCun theorize that removing these biases might be possible by training unsupervised models with additional, smaller datasets curated to “unteach” the biases. Beyond this, several “debiasing” methods have been proposed for natural language models fine-tuned from larger models. But it’s not a solved challenge by any stretch.

This being the case, products like Custom Sound and Custom Event Alerts illustrate the capabilities of more sophisticated, autonomous machine learning systems — assuming they work as advertised. In developing the earliest iterations of Alexa Guard, Amazon had to train machine learning models on hundreds of sound samples of glass breaking — a step that’s ostensibly no longer necessary.

Turing Award winners Yoshua Bengio and Yann LeCun believe that unsupervised and semi-supervised learning (among other techniques) are the key to human-level intelligence, and Custom Sound and Custom Event Alerts lend credence to that notion. The trick will be ensuring that they don’t fall victim to flaws that negatively influence their decision-making.

For AI coverage, send news tips to Kyle Wiggers — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more
Become a member