AI and machine learning algorithms are becoming increasingly good at predicting next actions in videos. The very best can anticipate fairly accurately where a baseball might travel after it has been pitched, or the appearance of a road miles from a starting position. To this end, a novel approach proposed by researchers at Google, the University of Michigan, and Adobe advances the state of the art with large-scale models that generate high-quality videos from only a few frames. All the more impressive, it does so without relying on techniques like optical flows (the pattern of apparent motion of objects, surfaces, or edges in a scene) or landmarks, as previous methods have.
“In this work, we investigate whether we can achieve high-quality video predictions … by just maximizing the capacity of a standard neural network,” wrote the researchers in a preprint paper describing their work. “To the best of our knowledge, this work is the first to perform a thorough investigation on the effect of capacity increases for video prediction.”
The team’s baseline model builds on an existing stochastic video generation (SVG) architecture, with a component that models the inherent uncertainty in future predictions. They separately trained and tested several versions of the model against data sets tailored to three prediction categories: object interactions, structured motion, and partial observability. For the first task — object interactions — the researchers selected 256 videos from a corpus of videos of a robot arm interacting with towels, and for the second — structured motion — they sourced clips from Human 3.6M, a corpus containing clips of humans performing actions like sitting on a chair. As for the partial observability task, they used the open source KITTI driving data set from front car dashboard camera footage.
The team conditioned every model on two to five video frames and had the models predict five to 10 frames into the future during training — at a low resolution (64 x 64 pixels) for all tasks and at both low and high resolutions (128 x 128 pixels) for the object interactions task. During testing, the models generated up to 25 frames.
The researchers report that one of the largest models was preferred 90.2%, 98.7%, and 99.3% of the time with respect to the object interactions, structured motion, and partial observability tasks, respectively, by evaluators recruited through Amazon Mechanical Turk. Qualitatively, the team notes that the model crisply depicted human arms and legs and made “very sharp predictions that looked realistic in comparison to the ground truth.
“Our experiments confirm the importance of recurrent connections and modeling stochasticity [or randomness] in the presence of uncertainty (e.g., videos with unknown action or control),” wrote the paper’s coauthors. “We also find that maximizing the capacity of such models improves the quality of video prediction. We hope our work encourages the field to push along similar directions in the future — i.e., to see how far we can get … for achieving high-quality video prediction.”