The system, which works locally on a smartphone or other portable device, comprises two kinds of neural networks: a recurrent neural network (RNN), which uses its internal state, or memory, to process inputs, and a convolutional neural network, a neural network that mimics the connectivity pattern between neurons in the human brain. On average, it’s 95 percent capable of recognizing words and phrases, Lott said.
“It learns from patterns [and] from your use of the device,” he said. “It can personalize its behavior to you.”
Most voice recognition systems today do most of their processing in the cloud, Lott explained. The microphones and chips in phones, smart home speakers like Google Home and Amazon’s Echo speakers, and Windows computers with Microsoft’s Cortana assistant enabled listen for “hot words” like “OK Google” and “Hey Cortana,” which prime the system for the string of voice commands to come. But they don’t analyze those commands — they relegate the grunt work to powerful remote servers running complex machine learning algorithms.
For some users, surrendering their voice data to the cloud raises privacy concerns. Both Amazon’s Alexa assistant and Google Assistant record snippets before sending them off for analysis, and they retain those voice snippets until users choose to delete them. Both companies say they use audio recordings to improve their services and provide more personalized responses.
But in some cases, the recordings don’t remain private. In 2016, detectives in Arizona investigating a murder sought access to voice data from an Amazon Echo speaker, which was ultimately shared, with the permission of the defendant.
On-device voice processing has benefits in addition to privacy, Lott said. Because it doesn’t have to offload data to the cloud, it responds instantly to commands, and because it doesn’t require an internet connection, it’s much more reliable.
“There’s a push to do the whole end-to-end system in some neural net fashion,” he said. “It’s something that’s going to make interacting with devices more natural.”
Lott has a point. In 2016, Google created an offline speech recognition system that was 7 times faster than its online system at the time. The model, which was trained on roughly 2,000 hours of voice data, was 20.3 megabytes in size and achieved 86.5 percent accuracy running on a smartphone.
Of course, on-device voice recognition has its own set of limitations. Algorithms designed to work offline can’t connect to the internet to search for answers to questions, and they miss out on the improvements made possible in cloud-based systems with larger, more diverse datasets.
But Lott thinks Qualcomm’s solution is the way forward. “A lot of things are happening on the cloud, but we think it should be happening directly on the device.”