Researchers develop offline speech recognition that's 97% accurate

Typically, deep learning approaches to voice recognition -- systems that employ layers of neuron-mimicking mathematical functions to parse human speech -- lean on powerful remote servers for bulk of processing. But researchers at the University of Waterloo and startup DarwinAI claim to have pioneered a strategy for designing speech recognition networks that not only achieves state-of-the-art accuracy, but which produces models robust enough to run on low-end smartphones.

They describe their method in a paper published on the preprint server Arxiv.org ("EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge"). It builds on work by Amazon's Alexa Machine Learning team, which earlier this year developed navigation, temperature control, and music playback algorithms that can be performed locally; Qualcomm, which in May claimed to have created on-device voice recognition models that are 95 percent accurate; Dublin, Ireland startup Voysis, which in September announced an offline WaveNet voice model for mobile devices; and Intel.

"In this study, we explore a human-machine collaborative design strategy for building low-footprint [deep neural network] architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration," the researchers wrote. "The efficacy of this design strategy is demonstrated through the design of a family of highly-efficient [deep neural networks] (nicknamed EdgeSpeechNets) for limited vocabulary speech recognition."

The team first constructed a prototype that performed limited-vocabulary speech recognition, or keyword spotting -- the ability to rapidly recognize specific keywords from a stream of speech. They then settled on a design method that transformed audio signals into mathematical representations called mel-frequency cepstrum coefficients, leveraging deep residual learning for "greater representation capabilities" than traditional techniques.

Next, they embarked on generative synthesis, a machine-driven design strategy that builds deep neural networks with an emphasis on performance. In this case, the researchers used a configuration that ensured the speech models' validation accuracy was at least 95 percent.

To evaluate the performance of the produced EdgeSpeechNets, the team used the Google Speech Commands dataset, a dataset containing 65,000 one-second samples of 30 short words and background noise samples.

One of the models -- EdgeSpeechNet-A -- achieved 1 percent higher accuracy compared to a state-of-the-art speech recognition model (res15) while requiring measurably less processing power. Moreover, it achieved test accuracy reaching 97 percent, outperforming previously published results.

Another model -- EdgeSpeechNet-D -- ran on a Motorola Moto E phone's 1.4GHz Cortex-A53 processor with a prediction latency of 34 milliseconds and a memory footprint of less than 1MB -- a tenfold decrease in latency and 16.5 percent smaller memory footprint than the aforementioned state-of-the-art neural network.

Yet another model -- EdgeSpeechNet-C, the smallest of the bunch -- managed higher accuracy than state-of-the-art with 7.8 fewer parameters (function arguments used to control certain properties of the training process) and 10.7 fewer multiply-add operations.

"The ... EdgeSpeechNets had higher accuracies at much smaller sizes at lower computations costs than state-of-the-art deep neural networks," the researchers wrote. "These results demonstrate that the EdgeSpeechNets were able to achieve state-of-the-art performance while still being noticeably smaller and requiring significantly fewer computations, making them very well-suited for on-device edge voice interface applications."

In future work, they plan to adapt the human-machine collaborative deep neural network design strategy to domains such as visual perception and natural language processing.