DeepMind's AI learns to generate realistic videos by watching YouTube clips

Perhaps you've heard of FaceApp, the mobile app that taps AI to transform selfies, or This Person Does Not Exist, which surfaces computer-generated photos of fictional people. But what about an algorithm whose videos are wholly novel? One of the newest papers from Google parent company Alphabet's DeepMind ("Efficient Video Generation on Complex Datasets") details recent advances in the budding field of AI clip generation. Thanks to "computationally efficient" components and techniques and a new custom-tailored data set, researchers say their best-performing model -- Dual Video Discriminator GAN (DVD-GAN) -- can generate coherent 256 x 256-pixel videos of "notable fidelity" up to 48 frames in length.

"Generation of natural video is an obvious further challenge for generative modeling, but one that is plagued by increased data complexity and computational requirements," wrote the coauthors. "For this reason, much prior work on video generation has revolved around relatively simple data sets, or tasks where strong temporal conditioning information is available. We focus on the tasks of video synthesis and video prediction ... and aim to extend the strong results of generative image models to the video domain."

The team built their system around a cutting-edge AI architecture and introduced video-specific tweaks that enabled it to train on Kinetics-600, a data set of natural videos "an order of magnitude" larger than commonly used corpora. Specifically, the researchers leveraged scaled-up generative adversarial networks, or GANs -- two-part AI systems consisting of generators that produce samples and discriminators that attempt to distinguish between the generated samples and real-world samples -- that have historically been applied to tasks like converting captions to scene-by-scene storyboards and generating images of artificial galaxies. The flavor here was BigGANs, which are distinguished by their large batch sizes and millions of parameters.

DVD-GAN contains dual discriminators: a spatial discriminator that critiques a single frame's content and structure by randomly sampling full-resolution frames and processing them individually, and a temporal discriminator that provides a learning signal to generate movement. A separate module -- a Transformer -- allowed learned information to propagate across the entire AI model.

As for the training data set (Kinetics-600), which was compiled from 500,000 10-second high-resolution YouTube clips originally curated for human action recognition, the researchers describe it as "diverse" and "unconstrained," which they claim obviated concerns about overfitting. (In machine learning, overfitting refers to models that correspond too closely to a particular set of data and as a result fail to predict future observations reliably.)

The team reports that after being trained on Google's AI-accelerating third-generation Tensor Processing Units for between 12 and 96 hours, DVD-GAN managed to create videos with object composition, movement, and even complicated textures like the side of an ice rink. It struggled to create coherent objects at higher resolutions where movement consisted of a much larger number of pixels, but the researchers note that, evaluated on UCF-101 (a smaller data set of 13,320 videos of human actions), DVD-GAN produced samples with a state-of-the-art Inception Score of 32.97.

"We further wish to emphasize the benefit of training generative models on large and complex video data sets, such as Kinetics-600," wrote the coauthors. "We envisage the strong baselines we established on this data set with DVD-GAN will be used as a reference point by the generative modeling community moving forward. While much remains to be done before realistic videos can be consistently generated in an unconstrained setting, we believe DVD-GAN is a step in that direction."

More