Generating videos from whole cloth isn’t anything new for neural networks — layers of mathematical functions modeled after biological neurons. In fact, researchers last week described a machine learning system capable of hallucinating clips from start and end frames alone. But because of the inherent randomness, complexity, and information denseness of videos, modeling realistic clips at scale remains something of a grand challenge for AI.
A team of scientists at Google Research, however, say they’ve made progress with novel networks that are able to produce “diverse” and “surprisingly realistic” frames from open source video data sets at scale. They describe their method in a newly published paper on the preprint server Arxiv.org (“Scaling Autoregressive Video Models“), and on a webpage containing selected samples of the model’s outputs.
“[We] find that our [AI] models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition data set of … videos exhibiting phenomena such as camera movement, complex object interactions, and diverse human movement,” wrote the coauthors. “To our knowledge, this is the first promising application of video-generation models to videos of this complexity.”
The researchers’ systems are autoaggressive, meaning they generate videos pixel by pixel, and they’re built upon a generalization of Transformers, a type of neural architecture introduced in a 2017 paper (“Attention Is All You Need“) coauthored by scientists at Google Brain, Google’s AI research division. As with all deep neural networks, Transformers contain neurons (functions) that transmit “signals” from input data and slowly adjust the synaptic strength — weights — of each connection. (That’s how the model extracts features and learns to make predictions.)
Uniquely, Transformers have attention, such that every output element is connected to every input element and the weightings between them are calculated dynamically. It’s this property that enables the video-generating systems to efficiently model clips as 3D volumes — rather than sequences of still frames — and drives direct interactions between representations of the videos’ pixels across dimensions.
To maintain a manageable memory footprint and create an architecture suited to tensor processing units, (TPUs), Google’s custom-designed AI workload accelerator chipsets, the researchers combined the Transformer-derived architecture with approaches that generate images as sequences of smaller, sub-scaled image slices. Their models produce “slices” (sub-sampled lower-resolution videos) by processing partially masked video input data with an encoder, the output of which is used as conditioning for decoding the current video slice. After a slice is generated, the padding in the video is replaced with the generated output and the process is repeated for the next slice.
In experiments, the team modeled slices of four frames by first feeding their AI systems video from the BAIR Robot Pushing robot data set, which consists of roughly 40,000 training videos and 256 test videos showing a robotic arm pushing and grasping objects in a box. Next they applied the models to down-sampled videos from the Kinetics-600 data set, a large-scale action-recognition corpus containing about 400,000 YouTube videos across 600 action classes.
Smaller models were trained for 300,000 steps, while larger ones were trained for 1 million steps.
The qualitative results were good — the team reports seeing “highly encouraging” generated videos for limited subsets such as cooking videos, which they note feature camera movement and complex object interactions like steam and fire and which cover diverse subjects. “This marks a departure from the often very narrow domains discussed in the video generation literature to date, such as artificially generated videos of moving digits or shapes,” wrote the researchers, “or videos depicting natural, yet highly constrained environments, such as robot arms interacting with a small number of different objects with a fixed camera angle and background.”
They concede that the models struggle with nuanced elements– like human motion of fingers and faces — and point out the many failure modes in the output data, which range from freezing movement or object distortions to continuations that break after a few frames. Those shortcomings aside, they claim state-of-the-art results in video generation and believe they have demonstrated an aptitude for modeling clips of an “unprecedented” complexity.