Meta claims its AI improves speech recognition quality by reading lips

People perceive speech both by listening to it and watching the lip movements of speakers. In fact, studies show that visual cues play a key role in language learning. By contrast, AI speech recognition systems are built mostly -- or entirely -- on audio. And they require a substantial amount of data to train, typically ranging in the tens of thousands of hours of recordings.

To investigate whether visuals -- specifically footage of mouth movement -- can improve the performance of speech recognition systems, researchers at Meta (formerly Facebook) developed Audio-Visual Hidden Unit BERT (AV-HuBERT), a framework that learns to understand speech by both watching and hearing people speak. Meta claims that AV-HuBERT is 75% more accurate than the best audiovisual speech recognition systems using the same amount of transcriptions. Moreover, the company says, AV-HuBERT outperforms the former best audiovisual speech recognition system using one-tenth of the labeled data -- making it potentially useful for languages with little audio data.

"In the future, AI frameworks like AV-HuBERT could be used to improve the performance of speech recognition technology in noisy everyday conditions -- for example, interactions at a party or in a bustling street market," Meta AI research scientist Abdelrahman Mohamed told VentureBeat in an interview. "And assistants in smartphones, augmented reality glasses, and smart speakers equipped with a camera -- e.g., Alexa Echo Show -- could benefit from this technology, too."

AV-HuBERT

Meta isn't the first to apply AI to the problem of lip-reading. In 2016, researchers at the University of Oxford created a system that was nearly twice as accurate as experienced lip readers in certain tests and could process video in close-to-real-time. And in 2017, Alphabet-owned DeepMind trained a system on thousands of hours of TV shows to correctly translate about 50% of words without errors on a test set, far better than a human expert's 12.4%.

But the University of Oxford and DeepMind models, as with many subsequent lip-reading models, were limited in the range of vocabulary that they could recognize. The models also required datasets paired with transcripts in order to train, and they couldn't process the audio of any speakers in the videos.

Somewhat uniquely, AV-HuBERT leverages unsupervised, or self-supervised, learning. With supervised learning, algorithms like DeepMind's are trained on labeled example data until they can detect the underlying relationships between the examples and particular outputs. For instance, a system might be trained to write the word "dog" (the output) when shown a picture of a Corgi (the example). However, AV-HuBERT teaches itself to classify unlabeled data -- processing the data to learn from its inherent structure.

AV-HuBERT is also multimodal in the sense that it learns to perceive language through a series of audio and lip-movement cues. By combining cues like the movement of the lips and teeth during speaking, along with auditory information, Meta says that AV-HuBERT can capture "nuanced associations" between the two data types.

The initial AV-HuBERT model was trained on 30 hours of labeled English-language TED Talk videos, substantially less than the 31,000 hours on which the previous state-of-the-art model was trained. But despite training on less data, AV-HuBERT's word error rate (WER), a measure of speech recognition performance, was slightly better at 32.5% versus the old model's 33.6% in cases where a speaker could be seen but not heard. (WER is calculated by dividing the number of incorrectly-recognized words by the total number of words; 32.5% translates to roughly one error every 30 words.) Training on 433 hours of TED Talks further reduced AV-HuBERT's WER to 28.6%.

Once AV-HuBERT learned the structure and correlation between the data well, the researchers were able to further train it on unlabeled data: 2,442 hours of English-language videos of celebrities uploaded to YouTube. Not only did this bring the WER down to 26.9%, but Meta says that it demonstrates that only a small amount of labeled data is needed to train the framework for a particular application (e.g., when multiple people are speaking simultaneously) or a different language.

Indeed, Meta claims that AV-HuBERT is about 50% better than audio-only models at recognizing a person's speech while loud music or noise is playing in the background. And when the speech and background noise are equally loud, AV-HuBERT manages a 3.2% WER versus the previous best multimodal model's 25.5%.

Potential shortcomings

In many ways, AV-HuBERT is emblematic of Meta's growing investment in unsupervised, multimodal technology for complex tasks. The company recently detailed a new multimodal system designed to tackle harmful content on its platforms, called Few-Shot Learner, and released models that can learn to recognize speech, segment images, copy the style of text, and recognize objects from unlabeled data. As opposed to supervised systems, unsupervised systems can be significantly more flexible and cheaper to deploy; the labels in labeled datasets come from human annotators who have to painstakingly add each one.

Because it requires less labeled data for training, Meta says that AV-HuBERT could open up possibilities for developing conversational models for "low-resource" languages, like Susu in the Niger Congo family. AV-HuBERT could also be useful in creating speech recognition systems for people with speech impairments, the company suggests, as well as detecting deepfakes and generating realistic lip movements for virtual reality avatars.

But Os Keyes, an AI ethicist at the University of Washington, expressed concerns that AV-HuBERT has limitations around class and disability baked in. "If you're trying to assess people's speech patterns from 'the movement of lips and teeth,' how does that work for people with distorted facial speech patterns as a result of disability?," they told VentureBeat via email. "It seems kind of ironic to manage to build software for speech recognition that depends on lip reading, and is likely to have inaccuracies when pointed at ... deaf people."

In a Microsoft and Carnegie Mellon paper proposing a research roadmap toward fairness in AI, the coauthors point out that aspects of facial analysis systems akin to AV-HuBERT may not work well for people with Down syndrome, achondroplasia (which impairs bone growth), and "other conditions that result in characteristic facial differences." Such systems might also fail for people who've had a stroke, the researchers note, or who have Parkinson’s disease, Bell’s Palsy, autism, or Williams syndrome -- who may not use (or be able to use) the same facial expressions as neurotypical people.

In an email, Mohamed emphasized that AV-HuBERT only focuses on the lip region to capture lip movements -- not the whole face. Similar to most AI models, the performance of AV-HuBERT will be "proportional to the number of representative samples of different populations in the training data," he added.

"For evaluating our approach, we used the publicly available LRS3 dataset, which consists of TED Talk videos that were made publicly available in 2018 by the University of Oxford researchers. Since this dataset doesn't represent speakers with disabilities, we do not have a specific percentage for the expected performance degradation," Mohamed said. "[But this] newly proposed technology is not limited by the current speaker distribution in the training dataset. We anticipate that different training datasets with coverage of broader and diverse populations would bring considerable performance gains."

Meta says that it will "continue to benchmark and develop approaches that improve audio-visual speech recognition models in everyday scenarios where background noise and speaker overlap are commonplace." Beyond this, it plans to extend AV-HuBERT -- which Meta doesn't plan to put into production -- to multilingual benchmarks beyond English.