Google's Imagen takes on Meta's Make-A-Video as text-to-video AI models ramp up

‘Tis the season of generative artificial intelligence (AI). Last week, Meta announced Make-A-Video, an AI system that allows users to turn text prompts into short, high-quality, one-of-a-kind video clips. Now, Google isn’t far behind. The text-to-video trend shows all the signs of getting ready to explode, much like text-to-image did over the past year with DALL-E, MidJourney and Stable Diffusion.

Announced just yesterday, Google’s Imagen Video is a text-to-video generative AI model capable of producing high-definition videos from a text prompt. The text-conditioned video diffusion model can generate videos up to a resolution of 1280×768 at 24 frames per second.

Google’s Imagen Video offers high fidelity

In its recently published paper “Imagen Video: High definition video generation with diffusion models” Google claims Imagen Video is capable of generating videos with high fidelity and has a high-degree of controllability and world knowledge. The generative model’s capabilities include creating diverse videos and text animations in different artistic styles, 3D understanding, text rendering and animation. The model is currently in a research phase, but its arrival comes just five months after Imagen showed the rapid development of synthesis-based models.

_{A look at Imagen Vide}

Imagen Video consists of a text encoder (frozen T5-XXL), a base video diffusion model, and interleaved spatial and temporal super-resolution diffusion models. To create such an architecture, Google claims it transferred findings from the previous work on diffusion-based image generation to the video generation setting. The research team also inculcated progressive distillation into the video models with classifier-free guidance for fast, high-quality sampling.

Cascade of seven sub-video diffusion models

The video generation framework is a cascade of seven sub-video diffusion models that perform text-conditional video generation, spatial super-resolution, and temporal super-resolution. With the entire cascade, Imagen Video generates high-definition 1280×768 videos at 24 frames per second for 128 frames — approximately 126 million pixels. With the help of progressive distillation, Imagen Video can generate high-quality videos using just eight diffusion steps per sub-model. This speeds up video generation time by a factor of about 18 times.

Comparison of progressively increasing resolutions generated by the spatial architecture at 200k training steps.

The model’s several notable stylistic abilities also include generating videos based on the work of renowned painters like Vincent van Gogh, rendering rotating objects in 3D while preserving their structure and rendering text in various animation styles.

Google says that Imagen Video was trained on the publicly available LAION-400M image-text dataset, as well as 14 million video-text pairs and 60 million image-text pairs. The training datasets allowed it to generalize a variety of aesthetics. In addition, a benefit of cascading models discovered by Google’s development team was that each diffusion model could be trained independently — allowing one to train all seven models in parallel.

A Google data dilemma

As generative models may also be misused to generate fake, hateful, explicit or harmful content, Google claims that it has taken multiple steps to minimize such concerns. Through internal trials, the company affirmed that it was able to apply input text prompt filtering and output video content filtering, but warned that there are still several important safety and ethical challenges that must be worked through.

Imagen Video and its frozen T5-XXL text encoder were trained on “problematic data." While internal testing shows that much of the explicit and violent content can be filtered out, Google says that social biases and stereotypes still exist which can be challenging to detect and filter. This was a major reason that Google decided not to release the model or its source code publicly until the concerns are mitigated.

Generative AI at Google and beyond?

According to Dumitru Erhan, a staff research scientist for Google Brain, there are efforts to strengthen research behind Phenaki — another Google text-to-video system — that can turn detailed text prompts into two-minute-plus videos; the main drawback of which is lower video quality.

The team working on Phenaki says that the model can take advantage of the vast text-image datasets to generate videos, where the user can also narrate and dynamically change scenes.

A generative AI trend that started with text-to-image and has begun to move to text-to-video, also seems to be slowly transforming towards text-to-3D — with models such as CLIP-Forge, a text-to-shape generation model that can generate 3D objects using zero-shot learning.

Google’s very own text-to-3D AI “DreamFusion”, released last week, is another prime example of generative AI moving towards a more aggressive 3D synthesis approach. DreamFusion utilizes Imagen to optimize a 3D scene.

Google’s Imagen Video offers high fidelity

Cascade of seven sub-video diffusion models

A Google data dilemma

Generative AI at Google and beyond?

More