Facebook rolls out automatic captions for Instagram TV

Facebook today announced the availability of automatic captions for Instagram TV (IGTV), beginning with captions for on-demand videos in 16 languages globally. The rollout follows the launch of automatic captioning for Facebook Live and Workplace Live, which arrived in March for six languages (English, Spanish, Portuguese, Italian, German, and French).

Facebook says expanded captioning builds upon the alternative text updates it made a few years ago to support people with limited vision. "As more people use the captions, the AI will learn and we expect the quality to continue to improve. This is a small step, and we'll look to expand to more surfaces, languages, and countries moving forward," a spokesperson told VentureBeat via email.

In a blog post, Facebook explains it leveraged a technique to train machine learning models powering automatic speech recognition to directly predict the graphemes (or characters) of words, simplifying the model training and deployment process. Using public Facebook posts to prime the system, engineers trained models to adapt to new words like "COVID" and predict where they'll occur in videos.

Facebook also says it was able to deploy these models with a number of infrastructure optimizations, enabling it to serve additional video traffic resulting from pandemic-related loads. According to Facebook, the number of Facebook Live broadcasts from Pages doubled in June 2020 compared to the same time last year.

Facebook launched its first automatic captioning product in February 2016, for video ads. In October of that same year, the social network rolled out a free video captioning tool for all U.S. English Facebook Pages. While the tools have no doubt improved over the years, anecdotal evidence suggests they have a long way to go, with videos like last year's Antares rocket launch showing nonsense words with auto-captioning enabled. As Forbes noted in a recent piece, captioning errors disproportionately affect the video-watching experience of those with hearing impairments.

Evidently cognizant of its systems' shortcomings, Facebook says it is investigating ways to improve captioning going forward. In a technical paper published last month, data scientists at the company described wav2vec 2.0, a speech recognition framework they claim attained state-of-the-art results using just 10 minutes of labeled data. In July, Facebook researchers detailed a model that learned to understand words in 51 languages after training on over 16,000 hours of voice recordings. And in a study last month, Facebook managed to reduce word error rate -- a common speech recognition performance metric -- by over 20% using a novel method.

More