Facebook details AI that can understand videos

On the heels of a computer vision system that achieved state-of-the-art accuracy with minimal supervision, Facebook today announced a project called Learning from Videos that's designed to automatically learn audio, textual, and visual representations from publicly available Facebook videos. By learning from videos spanning nearly every country and hundreds of languages, Facebook says the project will not only help it to improve its core AI systems but enable entirely new experiences. Already, Learning from Videos, which began in 2020, has led to improved recommendations in Instagram Reels, according to Facebook.

Continuously learning from the world is one of the hallmarks of human intelligence. Just as people quickly learn to recognize places, things, and other people, AI systems could be smarter and more useful if they managed to mimic the way humans learn. As opposed to relying on the labeled datasets used to train many algorithms today, Facebook, Google, and others are looking toward self-supervised techniques that require few or no annotations.

For example, Facebook says it's using Generalized Data Transformations (GDT), a self-supervised system that learns the relationships between sounds and images, to suggest Instagram Reel clips relevant to recently watched videos while filtering out near-duplicates. Consisting of a series of models trained across dozens of GPUs on a dataset of millions of Reels and videos from Instagram, GDT can learn that a picture of an audience clapping probably goes with the sound of applause or that a video of a plane taking off likely goes with a loud roar. Moreover, the system can surface recommendations based on videos that sound alike or look alike, respectively, by leveraging audio as a signal.

When asked which Facebook and Instagram users were subjected to having their content used to train systems like GDT and whether those users were informed the content was being used in this way, a Facebook spokesperson told VentureBeat that the company informs account holders in its data policy that Facebook "uses the information we have to support research and innovation." In training other computer vision systems such as SEER, a self-supervised AI model that Facebook detailed last week, OneZero notes that the company has purposely excluded user images from the European Union, likely because of GDPR.

Learning from Videos also encompasses Facebook's work on wav2vec 2.0, an improved machine learning framework for self-supervised speech recognition. The company says that when applied to millions of hours of unlabeled videos and 100 hours of labeled data, wave2vec 2.0 reduced the relative word error rate by 20% compared with supervised-only baselines. As a next step, Facebook says it's working to scale wav2vec 2.0 with millions of additional hours of speech from 25 languages to reduce labeling, bolster the performance of low-and medium-resource models, and improve other speech and audio tasks.

In a related effort, to make it easier to search across videos, Facebook says it's using a system called the Audio Visual Textual (AVT) model that aggregates and compares sound and visual information from videos as well as titles, captions, and descriptions. Given a command like "Show me every time we sang to Grandma," the AVT model can find its location and highlight the nearest timestamps in the video. Facebook says it's working to apply the model to millions of videos before it begins testing it across its platform. It's also adding speech recognition as one of the inputs to the AVT model, which will allow the system to respond to phrases like "Show me the news show that was talking about Yosemite."

TimeSformer

The Learning from Videos project also birthed TimeSformer, a Facebook-developed framework for video understanding that's based purely on the Transformer architecture. Transformers employ a trainable attention mechanism that specifies the dependencies between elements of each input sequence -- for instance, amino acids within a protein. It’s this that enables them to achieve state-of-the-art results in areas of machine learning including natural language processing, neural machine translation, document generation and summarization, and image and music generation.

Facebook claims that TimeSformer, short for Time-Space Transformer, attains the best reported numbers on a range of action recognition benchmarks. It also takes roughly one-third the time to train than comparable models. And it requires less than one-tenth the amount of compute for inference and can learn from video clips up to 102 seconds in length, much longer than most video-analyzing AI models. Facebook AI research scientist Lorenzo Torresani told VentureBeat that TimeSformer can be trained in 14 hours with 32 GPUs.

"Since TimeSformer specifically enables analysis of much longer videos, there’s also the opportunity for interesting future applications such as episodic memory retrieval -- ability to detect particular objects of interest that were seen by an agent in the past -- and classifying multi-step activities in real time like recognizing a recipe when someone is cooking with their AR glasses on," Torresani said. "Those are just a few examples of where we see this technology going in the future."

It's Facebook's assertion that systems like TimeSformer, GDT, wav2vec 2.0, and AVT will advance research to teach machines to understand long-form actions in videos, an important step for AI applications geared toward human understanding. The company also expects they'll form the foundation of applications that can comprehend what's happening in videos on a more granular level.

"[All] these models will be broadly applicable, but most are research for now. In the future, when applied in production, we believe they could do things like caption talks, speeches, and instructional videos; understand product mentions in videos; and search and classification of archives of recordings," Geoffrey Zweig, director at Facebook AI, told VentureBeat. "We are just starting to scratch the surface of self-supervised learning. There’s lots to do to build upon the models that we use, and we want to do so with speed and at scale for broad applicability."

Facebook chose not to respond directly to VentureBeat's question about how any bias in Learning from Videos models might be mitigated, instead saying: "In general, we have a cross-functional, multidisciplinary team dedicated to studying and advancing responsible AI and algorithmic fairness, and we’re committed to working toward the right approaches. We take this issue seriously, and have processes in place to ensure that we’re thinking carefully about the data that we use to train our models."

Research has shown that state-of-the-art image-classifying AI models trained on ImageNet, a popular (but problematic) dataset containing photos scraped from the internet, automatically learn humanlike biases about race, gender, weight, and more. Countless studies have demonstrated that facial recognition is susceptible to bias. It's even been shown that prejudicescan creep into the AI tools used to create art, potentially contributing to false perceptions about social, cultural, and political aspects of the past and hindering awareness about important historical events.

Facebook chief AI scientist Yann LeCun recently admitted to Fortune that fully self-supervised computer vision systems can pick up the biases, including racial and gender stereotypes, inherent in the data. In acknowledgment of the problem, a year ago Facebook set up new teams to look for racial bias in the algorithms that drive its social network as well Instagram. But a bombshell report in MIT Tech Review this week revealed that at least some of Facebook's internal efforts to mitigate bias were coopeted to protect growth or in anticipation of regulation. The report further alleges that one division's work, Responsible AI, became essentially irrelevant to fixing the larger problems of misinformation, extremism, and political polarization.