Google's AI can create videos from start and end frames alone

Imagine this: You're given the start and end of a video and tasked with sketching out the interleaving frames, inferring what you can from the limited information on hand in order to fill the gap. Could you? It might sound like an impossible task, but researchers at Google's AI research division have developed a novel system that can generate "plausible" video sequences from no more than a single first and final frame, a process known as "inbetweening."

They describe their work in a newly published paper ("From Here to There: Video Inbetweening Using Direct 3D Convolutions") on the preprint server Arxiv.org.

"Imagine if we could teach an intelligent system to automatically turn comic books into animations. Being able to do so would undoubtedly revolutionize the animation industry," wrote the paper's coauthors. "Although such an immensely labor-saving capability is still beyond the current state-of-the-art, advances in computer vision and machine learning are making it an increasingly more tangible goal."

The AI system comprises a fully convolutional model -- a class of deep neural networks inspired by the animal visual cortex that's most commonly applied to analyzing visual imagery -- with three components: a 2D-convolutional image decoder, a 3D-convolutional latent representation generator, and a video generator. The image decoder maps frames from target videos to a latent space, while the latent representation generator learns to incorporate the information contained in the input frames. Finally, the video generator decodes the latent representation into video frames.

The researchers say that separating the latent representation generation from video decoding was of "crucial importance" to successfully achieving video inbetweening, and that their attempts to generate videos directly from the encoded representations of the start and end frames ended poorly. To address this, they designed the latent representation generator to fuse frame representations and progressively increase the generated video's resolution.

To validate their approach, the researchers sourced videos from three datasets -- BAIR robot pushing, KTH Action Database, and UCF101 Action Recognition Data Set -- and downsampled them to a resolution of 64 x 64 pixels. Each sample contained 16 frames in total, 14 of which the AI system was tasked with generating. The researchers ran the model 100 times for each pair of video frames and repeated the process 10 times for each model variant and data set. (Training took around 5 days on an Nvidia Tesla V100 graphics card.)

The result? The AI-generated sequences were similar in style and consistent with the given start and end frames, the researchers report, and moreover both "meaningful" and diverse. "The rather surprising fact that video inbetweening can be achieved over such a long time base," wrote the team, "[may] provide a useful alternative perspective for future research on video generation."

More