Facebook Wav2vec-U learns to recognize speech from unlabeled data

Facebook today announced that it trained an AI model to build speech recognition systems that don't require transcribed data. The company, which trained systems for Swahili, Tatar, Kyrgyz, and other languages, claims that the model, wav2vec Unsupervised (Wav2vec-U), is an important step toward building machines that can solve a range of tasks by learning from their observations.

AI-powered speech transcription platforms are a dime a dozen in a market estimated to be worth over $1.6 billion. Deepgram and Otter.ai build voice recognition models for cloud-based real-time processing, while Verbit offers tech not unlike that of Oto, which combines intonation with acoustic data to bolster speech understanding. Amazon, Google, Facebook, and Microsoft offer their own speech transcription services.

But the dominant form of AI for speech recognition falls into a category known as supervised learning. Supervised learning is defined by its use of labeled datasets to train algorithms to classify data and predict outcomes, which, while effective, is time-consuming and expensive. Companies have to obtain tens of thousands of hours of audio and recruit human teams to manually transcribe the data. And this same process has to be repeated for each language.

Unsupervised speech recognition

Facebook's Wav2vec-U solves the challenges in supervised learning by taking a self-supervised (also known as unsupervised) approach. With unsupervised learning, Wav2vec-U is fed "unknown" data for which no previously defined labels exist. The system must teach itself to classify the data, processing it to learn from its structure.

While relatively underexplored in the speech domain, a growing body of research demonstrates the potential of learning from unlabeled data. Microsoft is using unsupervised learning to extract knowledge about disruptions to its cloud services. More recently, Facebook itself announced SEER, an unsupervised model trained on a billion images that achieves state-of-the-art results on a range of computer vision benchmarks.

Wav2vec-U learns purely from recorded speech and text, eliminating the need for transcriptions. Using a self-supervised model and Facebook's wav2vec 2.0 framework as well as what's called a clustering method, Wav2vec-U segments recordings into units that loosely correspond to particular sounds.

To learn to recognize words in a recording, Facebook trained a generative adversarial network (GAN) consisting of a generator and a discriminator. The generator takes audio segments and predicts a phoneme (i.e., unit of sound) corresponding to a sound in language. It's trained by trying to fool the discriminator, which assesses whether the predicted sequences seem realistic. As for the discriminator, it learns to distinguish between the speech recognition output of the generator and real text from examples of text from sources that were "phonemized," in addition to the output of the generator.

While the GAN's transitions are initially poor in quality, they improve with the feedback of the discriminator.

"It takes about half a day -- roughly 12 to 15 hours on a single GPU -- to train an average Wav2vec-U model. This excludes self-supervised pre-training of the model, but we previously made these models publicly available for others to use," Facebook AI research scientist manager Michael Auli told VentureBeat via email. "Half a day on a single GPU is not very much, and this makes the technology accessible to a wider audience to build speech technology for many more languages of the world."

To get a sense of how well Wav2vec-U works in practice, Facebook says it evaluated it first on a benchmark called TIMIT. Trained on as little as 9.6 hours of speech and 3,000 sentences of text data, Wav2vec-U reduced the error rate by 63% compared with the next-best unsupervised method.

Wav2vec-U was also as accurate as the state-of-the-art supervised speech recognition method from only a few years ago, which was trained on hundreds of hours of speech data.

Future work

AI has a well-known bias problem, and unsupervised learning doesn't eliminate the potential for bias in a system's predictions. For example, unsupervised computer vision systems can pick up racial and gender stereotypes present in training datasets. Some experts, including Facebook chief scientist Yann LeCun, theorize that removing these biases might require a specialized training of unsupervised models with additional, smaller datasets curated to "unteach" specific biases.

Facebook acknowledges that more research must be done to figure out the best way to address bias. "We have not yet investigated potential biases in the model. Our focus was on developing a method to remove the need for supervision," Auli said. "A benefit of the self-supervised approach is that it may help avoid biases introduced through data labeling, but this is an important area that we are very interested in."

In the meantime, Facebook is releasing the code for Wav2vec-U in open source to enable developers to build speech recognition systems using unlabeled speech audio recordings and unlabeled text. While Facebook didn't use user data for the study, Auli says that there's potential for the model to support future internal and external tools, like video transcription.

"AI technologies like speech recognition should not benefit only people who are fluent in one of the world's most widely spoken languages. Reducing our dependence on annotated data is an important part of expanding access to these tools," Facebook wrote in a blog post. "People learn many speech-related skills just by listening to others around them. This suggests that there is a better way to train speech recognition models, one that does not require large amounts of labeled data."