Hear from CIOs, CTOs, and other C-level and senior execs on data and AI strategies at the Future of Work Summit this January 12, 2022. Learn more

AI and machine learning algorithms capable of reading lips from videos aren’t anything out of the ordinary, in truth. Back in 2016, researchers from Google and the University of Oxford detailed a system that could annotate video footage with 46.8% accuracy, outperforming a professional human lip-reader’s 12.4% accuracy. But even state-of-the-art systems struggle to overcome ambiguities in lip movements, preventing their performance from surpassing that of audio-based speech recognition.

In pursuit of a more performant system, researchers at Alibaba, Zhejiang University, and the Stevens Institute of Technology devised a method dubbed Lip by Speech (LIBS), which uses features extracted from speech recognizers to serve as complementary clues. They say it manages industry-leading accuracy on two benchmarks, besting the baseline by a margin of 7.66% and 2.75% in character error rate.

LIBS and other solutions like it could help those hard of hearing to follow videos that lack subtitles. It’s estimated that 466 million people in the world suffer from disabling hearing loss, or about 5% of the world’s population. By 2050, the number could rise to over 900 million, according to the World Health Organization.

lip reading

LIBS distills useful audio information from videos of human speakers at multiple scales, including at the sequence level, context level, and frame level. It then aligns this data with video data by identifying the correspondence between them (due to different sampling rates and blanks that sometimes appear at the beginning or end, the video and audio sequences have inconsistent lengths), and it leverages a filtering technique to refine the distilled features.

Both the speech recognizer and lip reader components of LIBS are based on an attention-based sequence-to-sequence architecture, a method of machine translation that maps an input of a sequence (i.e., audio or video) to an output with a tag and attention value. The researchers trained them on the aforementioned and LRS2, which contains more than 45,000 spoken sentences from the BBC, and on CMLR, the largest available Chinese Mandarin lip-reading corpus with over 100,000 natural sentences from the China Network Television website (including over 3,000 Chinese characters and 20,000 phrases).

The team notes that the model struggled to achieve “reasonable” results on the LRS2 data set, owing to the shortness of some sentences. (The decoder struggles to extract relevant information from sentences with fewer than 14 characters.) However, once it was pre-trained on sentences with a maximum length of 16 words, the decoder improved the quality of the end parts of sentences in the LRS2 data set by leveraging context-level knowledge. “[LIBS reduces] the focus on unrelated frames,” wrote the researchers in a paper describing their work. “[T]he frame-level knowledge distillation further improves the discriminability of the video frame features, making the attention more focused.”


VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more
Become a member