We're thrilled to announce the return of GamesBeat Next, hosted in San Francisco this October, where we will explore the theme of "Playing the Edge." Apply to speak here and learn more about sponsorship opportunities here. At the event, we will also announce 25 top game startups as the 2024 Game Changers. Apply or nominate today!


What do the world’s most popular virtual assistants — Google Assistant, Amazon’s Alexa, Microsoft’s Cortana, and Apple’s Siri — have in common? They perform much of their speech recognition in the cloud, where their natural language models take advantage of powerful servers with nearly limitless processing power. It’s amenable for the most part — typically, processing happens in milliseconds — but poses an obvious problem for users who find themselves without an internet connection.

Luckily, the Alexa Machine Learning team at Amazon recently made headway in bringing voice recognition models offline. They’ve developed navigation, temperature control, and music playback algorithms that can be performed locally, on-device.

The results of their research (“Statistical Model Compression for Small-Footprint Natural Language Understanding“) will be presented at this year’s Interspeech machine learning conference in Hyderabad, India.

It wasn’t easy. As the researchers explained, natural language processing models tend to have significant memory footprints. And the third-party apps that extend Alexa’s functionality — skills — are loaded on-demand, only when needed; storing them in memory adds significant latency to voice recognition.

Event

VB Transform 2023 On-Demand

Did you miss a session from VB Transform 2023? Register to access the on-demand library for all of our featured sessions.

 

Register Now

“Alexa’s natural-language-understanding systems … use several different types of machine-learning (ML) models, but they all share some common traits,” wrote Grant Strimel, a lead author, in the blog post. “One is that they learn to extract ‘features’ — or strings of text with particular predictive value — from input utterances … Another common trait is that each feature has a set of associated ‘weights,’ which determine how large a role it should play in different types of computation. The need to store multiple weights for millions of features is what makes ML models so memory intensive.”

Eventually, they settled on a two-part solution: parameter quantization and perfect feature hashing.

Quantization — the process of converting a continuous range of values into a finite range of discrete values — is a conventional technique in algorithmic model compression. Here, the researchers divvied up the weights into 256 intervals, which allowed them to represent every weight in the model with a single byte of data. They rounded low weights to zero so that they could be discarded.

The researchers’ second technique leveraged hash functions — functions that, as Strimel wrote, “takes arbitrary inputs and scrambles them up … in such a way that the outputs (1) are of fixed size and (2) bear no predictable relationship to the inputs.” For example, if the output size was 16 bits with 65,536 possible hash values, a value of 1 might map to “Weezer,” while a value of 50 might correspond to “Elton John.”

The problem with hash functions, though, is that they tend to result in collisions, or related values (e.g., “Hank Williams, Jr.” and “Hank Williams”) that don’t map to the same coarse location in the list of hashes. The metadata required to distinguish between the values’ weights often requires more space in memory than the data it’s tagging.

To account for collisions, the team used a technique called perfect hashing, which maps a specific number of data items to the same number of memory slots.

“[T]he system can simply hash a string of characters and pull up the corresponding weights — no metadata required,” Strimel wrote.

In the end, the team said, quantization and hash functions resulted in a 14-fold reduction in memory usage compared to the online voice recognition models. And impressively, it didn’t affect accuracy — the offline algorithms performed “almost as well” as the baseline models, with error increases of less than 1 percent.

“We observed the methods sacrifice minimally in terms of model evaluation time and predictive performance for the substantial compression gains observed,” they wrote. “We aim to reduce … memory footprint to enable local voice-assistants and decrease latency of [natural language processing] models in the cloud.”

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.