Google's AI learns how actions in videos are connected

AI systems have become quite competent at recognizing objects (and actions) in videos from diverse sources. But they aren't perfect, in part because they're mostly trained on corpora containing clips with single labels. Frame-by-frame tracking isn't a particularly efficient solution because it would require that annotators apply labels to every frame in each video, and because "teaching" a model to recognize an action it hadn't seen before would necessitate labeling new clips from scratch.

That's why scientists at Google propose Temporal Cycle-Consistency Learning (TCC), a self-supervised AI training technique that taps "correspondences" between examples of similar sequential processes (like weight-lifting repetitions or baseball pitches) to learn representations well-suited for temporal video understanding. The codebase is available in open source on GitHub.

As the researchers explain, footage that captures certain actions contains key common moments -- or correspondences -- that exist independent of factors like viewpoint changes, scale, container style, or the speed of the event. TCC attempts to find such correspondences across videos by leveraging cycle-consistency.

First, a training algorithm produces embeddings (mathematical representations) of video frames by ingesting each frame individually. Two videos for TCC learning are then selected, and the embedding of a reference frame chosen from one of the two is used to identify a nearest neighbor frame from the second video. A sanity check ensures that the last frame refers back to the starting reference frame, and an embedder over the course of training develops a semantic understanding of each video frame in the context of the action being performed.

The researchers say that TCC can be used to classify the phases of different actions with as few as a single labeled video, and that it can align many clips at once by selecting the nearest neighbor to each frame in a reference video. Moreover, they say that it can transfer metadata (like temporal semantic labels, sound, or text) associated with any frame in one video to its matching frame in another video, and that each frame in a given video could be used to retrieve similar frames by looking up the nearest neighbors in the embedding space.

In one experiment, the researchers report a supervised learning approach that didn't use TCC required about 50 videos with each frame labeled to achieve the same accuracy that a self-supervised TCC method managed with just one fully labeled video. In another, the team successfully transferred the sound of liquid being poured into a cup from one video to another.

"This ... will be useful for researchers working on video understanding, as well as artists looking to use machine learning to align videos to create mosaics of people, animals, and objects moving synchronously," wrote Google Research research associate Debidatta Dwibedi.