Recognizing activities and anticipating which might come next is easy enough for humans, who make such predictions subconsciously all the time. But machines have a tougher go of it, particularly where there’s a relative dearth of labeled data. (Action-classifying AI systems typically train on annotations paired with video samples.) That’s why a team of Google researchers propose VideoBERT, a self-supervised system that tackles various proxy tasks to learn temporal representations from unlabeled videos.

As the researchers explain in a paper and accompanying blog post, VideoBERT’s goal is to discover high-level audio and visual semantic features corresponding to events and actions unfolding over time. “[S]peech tends to be temporally aligned with the visual signals [in videos], and can be extracted by using off-the-shelf automatic speech recognition (ASR) systems,” said Google researcher scientists Chen Sun and Cordelia Schmid. “[It] thus provides a natural source of self-supervision.”

To define tasks that would lead the model to learn the key characteristics of activities, the team tapped Google’s BERT, a natural language AI system designed to model relationships among sentences. Specifically, they used image frames combined with speech recognition system sentence outputs to convert the frames into 1.5-second visual tokens based on feature similarities, which they concatenated with word tokens. Then, they tasked VideoBERT with filling out the missing tokens from the visual-text sentences.

Above: Action anticipation accuracy with the CBT approach from untrimmed videos with 200 activity classes.

The researchers trained VideoBERT on over one million instructional videos across categories like cooking, gardening, and vehicle repair. In order to ensure that it learned semantic correspondences between videos and text, the team tested its accuracy on a cooking video dataset in which neither the videos nor annotations were used during pre-training. The results show that VideoBERT successfully predicted things like that a bowl of flour and cocoa powder may become a brownie or cupcake after baking in an oven, and that it generated sets of instructions (such as a recipe) from a video along with video segments (tokens) reflecting what’s described at each step.

That said, VideoBERT’s visual tokens tend to lose fine-grained visual information, such as smaller objects and subtle motions. The team addressed this with a model they call Contrastive Bidirectional Transformers (CBT), which removes the tokenization step. Evaluated on a range of data sets covering action segmentation, action anticipation, and video captioning, CBT reportedly outperformed state-of-the-art by “significant margins” on most benchmarks.


Above: Results from VideoBERT, pretrained on cooking videos

Image Credit: Google

The researchers leave to future work learning low-level visual features jointly with long-term temporal representations, which they say might enable better adaptation to video context. Additionally, they plan to expand the number of pre-training videos to be larger and more diverse.

“Our results demonstrate the power of the BERT model for learning visual-linguistic and visual representations from unlabeled videos,” wrote the researchers. “We find that our models are not only useful for … classification and recipe generation, but the learned temporal representations also transfer well to various downstream tasks, such as action anticipation.”

How startups are scaling communication: The pandemic is making startups take a close look at ramping up their communication solutions. Learn how