Smart assistants and voice-enabled speakers are more popular now than ever before. About 47.3 million U.S. adults have access to a smart speaker, according to Voicebot.ai, and just over half of smartphone owners — 52 percent — report that they use voice assistants on their mobile devices. But popularity doesn’t necessarily translate to accuracy. As anyone who’s tried to get Cortana or Alexa’s attention at a party can tell you, they’re not exactly aces when it comes to isolating speech from a crowd.
Boston, Massachusetts-based Yobe claims it could make the assistants better listeners. The startup, which was founded out of the Massachusetts Institute of Technology (MIT) and raised nearly $2 million in seed funding from Clique Capital Partners and a National Science Foundation SBIR grant, today launched Voice Identification System for user Profile Retrieval (VISPR), an “intelligence” that can identify, track, and separate voices in noisy environments. It claims that artificial intelligence (AI) allows its software stack to accurately track a voice in “any auditory environment.”
Yobe says that with VISRP, mic-sporting devices like smartwatches, hearing aids, and smart home appliances can identify voices with no more than a wake word and can perform far-field voice personalization. It also claims VISPR can reduce speech recognition errors by up to 85 percent.
“[Our] technology is fixing the most persistent challenge of voice technology in the market today,” said Yobe CEO and cofounder Ken Sutton. “Smart phones, speakers, and other connected devices have been limited in providing an exceptional voice user interface.”
Sutton, who founded Yobe with MIT PhD and AI-assisted signal processing researcher Dr. S. Hamid Nawab, said the company will focus its efforts on licensing.
VISPR takes a multipronged approach to the cocktail party problem. Its AI models actively reason through interactions of voices and ambient noise, while its signal processing pipeline adapts to changes in “scene characteristics” — i.e., the acoustics of a room, the number of speakers, and overall noise level — on the fly. That same pipeline taps sophisticated temporal, spectral, and statistical techniques to parse incoming audio signals and generalize different microphone-array sizes and configurations. (Not all voice-enabled devices are created equal — Amazon’s Echo Dot has 7 microphones compared to the Google Home Mini’s 2, for example.)
In plain English, VISRP records sound and amplifies it, uses AI to denoise it and isolate individual voices, and listens for telltale biometric identifiers unique to each person. It’s akin to Google’s Voice Match and Amazon’s Alexa Voice Profiles in that it can retrieve user profiles and permission associated with a speaker, but Yobe claims its solution is much more robust.
The product launch comes weeks after scientists at Google and the Idiap Research Institute in Switzerland detailed an AI voice recognition system that “significantly” reduced word error rate (WER) on multispeaker signals. In the same vein of research, MIT’s Computer Science and Artificial Intelligence Lab earlier this year demoed tech — PixelPlayer — that learned to isolate the sounds of individual instruments from YouTube videos. And in 2015, researchers at the University of Surrey designed an AI model that output vocal spectrograms when fed songs.