Meta's new Make-a-Video signals the next generative AI evolution

This morning Meta CEO Mark Zuckerberg posted on his Facebook page to announce Make-A-Video, a new AI system that allows users to turn text prompts, like "a teddy bear painting a self-portrait," into short, high-quality, one-of-a-kind video clips.

Sound like DALL-E? That's the idea: According to a press release, Make-A-Video builds on AI image generation technology (including Meta's Make-A-Scene work from earlier this year) by "adding a layer of unsupervised learning that allows the system to understand motion in the physical world and apply it to traditional text-to-image generation."

"This is pretty amazing progress," Zuckerberg wrote in his post. "It's much harder to generate video than photos because beyond correctly generating each pixel, the system also has to predict how they'll change over time."

A year after DALL-E

It's hard to believe that it has been only about a year since the original DALL-E was unveiled January 2021, while 2022 has seemed to be the year of the text-to-image revolution thanks to DALL-E 2, Midjourney, Stable Diffusion and other large generative models allowing users to create realistic images and art from natural text prompts.

Is Meta's new Make-A-Video a sign that the next step of generative AI, text-to-video, is about to go mainstream? Given the sheer speed of text-to-image evolution this year -- Midjourney even created controversy with an image that won an art competition at the Colorado State Fair -- it certainly seems possible. A couple of weeks ago, video editing software company Runway released a promotional video teasing a new feature of its AI-powered web-based video editor that can edit video from written descriptions.

And the demand for text-to-video generators at the level of today's text-to-image options is high, thanks to the need for video content across all channels -- from social media advertising and video blogs to explainer videos.

Meta, for its part, seems confident, according to its research paper introducing Make-A-Video: "In all aspects, spatial and temporal revolution, faithfulness to text, and quality, we present state-of-the-art results in text-to-video generation, as determined by both qualitative and quantitative measures."

A year after DALL-E

More