How diffusion models unlock new possibilities for generative creativity

Generative artificial intelligence (AI) models continue to gain popularity and recognition. The technology’s recent advancement and success in the image-generation domain have created a wave of interest among tech companies and machine learning (ML) practitioners, who are now steadily adopting generative AI models for several business use cases.

The emergence of text-to architectures is fueling this adoption further, with generative AI models such as Google’s Imagen Video, Meta’s Make-A-Video and others like DALL-E, MidJourney and Stable Diffusion.

A common denominator among all generative AI architectures is the use of a method known as the diffusion model, which takes inspiration from the physical process of gas molecule diffusion, where the molecules diffuse from high-density to low-density areas.

Similar to the scientific process, the model starts by collecting random noise from the provided input data, which gets subtracted in a series of steps that creates an aesthetically pleasing and ideally coherent image. By guiding noise removal in a way that favors conforming to a text prompt, diffusion models can create images with higher fidelity.

For implementing generative AI, the use of diffusion models has become evident recently, showing signs of taking over from past methods such as generative adversarial networks (GANs) and transformers in the domain of conditional image synthesis, as diffusion models can produce state-of-the-art images while maintaining quality and the semantic structure of the data — and being unaffected by training drawbacks such as mode collapse.

A new way of AI-based synthesis

One of the recent breakthroughs in computer vision and ML was the invention of GANs, which are two-part AI models consisting of a generator that creates samples and a discriminator that attempts to differentiate between the generated samples and real-world samples. This method became a stepping stone for a new field known as generative modeling. However, after going through a boom phase, GANs started to plateau, as most of the methods struggled to solve the bottlenecks faced by the adversarial techniques, a brute force supervised learning method where as many examples as possible are fed to train the model.

GANs work well for multiple applications, but they are difficult to train, and their output lacks diversity. For example, GANs often suffer from unstable training and from mode collapse, an issue where the generator may learn to produce only one output that seems most plausible, while autoregressive models typically suffer from slow synthesis speed.

Building upon such backlogs, the diffusion model technique originated from probabilistic likelihood estimation, a method of estimating the output of a statistical model through observations from the data, finding parameter values that maximize the likelihood of making the prediction.

Diffusion models are generative models (a type of AI model which learns to model data distribution from the input). Once learned, these models can generate new data samples similar to those which they are trained on. This generative nature led to its rapid adoption for several use cases such as image and video generation, text generation and synthetic data generation to name a few.

Diffusion models work by deconstructing training data through the successive addition of Gaussian noise, and then learning to recover the data by reversing this noising process. After training, the model can generate data by simply passing randomly sampled noise through the learned de-noising process. This synthesis procedure can be interpreted as an optimization algorithm that follows the gradient of the data density to produce likely samples.

_{The 3D diffusion and reconstruction from text process architecture, Image Source: DreamFus}

“Diffusion models help address the drawbacks of GAN by handling noise better and producing a much higher diversity of images with similar or higher quality images while requiring low effort in training,” said Swapnil Srivastava, VP and global head of data and analytics at Evalueserve. “As diverse synthetic data is a primary need for all data science architectures, diffusion models are better at addressing the problems and allowing for the scale required for developing advanced AI projects."

Beyond higher image quality, diffusion models have many other benefits and do not require adversarial training. Other well-known methods, like transformers, require massive amounts of data and face a plateau in terms of performance in vision domains compared to diffusion models.

Current market adoption of diffusion models

Using diffusion models for generative AI can aid in leveraging several unique capabilities, including creating diverse images and text rendering in different artistic styles, 3D understanding and animation.

Progressing from plain image synthesis, the capabilities of these next-gen models are moving toward video and 3D generation. The recently released Imagen Video by Google and Make-a-Video by Meta are prime examples of the high-level capabilities of generative AI.

Imagen Video consists of a text encoder (frozen T5-XXL), a base video diffusion model, and interleaved spatial and temporal super-resolution diffusion models. Similarly, Make-a-Video’s video diffusion models (VDM) use a space-time factorized U-Net with joint image and video data training. In addition, the VDM was trained on 10 million private text-video open-source dataset pairs, which made it easier for the model to produce videos from the provided text.

Saam Motamedi, general partner at Silicon Valley venture capital firm Greylock, says that the market adoption of generative AI, such as diffusion models, has exponentially accelerated because they make it easier for developers to build on top of existing models and help leverage advanced capabilities in their applications.

“Diffusion models’ ability to produce stable and state-of-the-art results signal[s] the next generative AI evolution,” Motamedi told VentureBeat. “These advances in different generative techniques around all data modalities such as text, image, video, audio and multi-modal data, will birth new and impactful use cases.”

Srivastava said that generative AI powered by diffusion models can reduce time and effort during industrial or robotic product development, increase creativity and reusability in marketing, allow content creators to create new-generation content or NFTs, and be used for diagnosis and antibody testing.

“The possibility and future for text-to-video would be multifold, with applicability across immersive experiences in the metaverse to its applicability in video production and creative media," he said. "In the social media space, we anticipate seeing a new way content creators use such technology at scale to drive engagement and thereby the adoption of such technology."

The AI research team at IBM recently integrated diffusion models as one of its techniques, using them for applications like chemistry, materials design and discovery. IBM’s Generative Toolkit for Scientific Discovery (GT4SD) is an open-source library that uses generative models to generate new molecule designs based on properties like target proteins, target omics (i.e. genomics, transcriptomics, proteomics or metabolomics) profiles, scaffolds distances, binding energies and additional targets relevant to materials and drug discovery.

GT4SD includes a wide range of generative models and training methods including variational autoencoders, sequence-to-sequence models, and diffusion models, where the objective is to provide and connect state-of-the-art generative models and methods for different scientific discovery challenges.

John Smith, an IBM fellow in discovery technology foundations, accelerated discovery, said that designing and discovering new chemicals is a huge challenge due to the practically infinite search spaces, and generative models are one of the most promising approaches for addressing this difficulty.

“Generative models provide a way to use AI to creatively propose novel chemical entities and formulations that target desired properties,” Smith told VentureBeat. “We hope that by seeding this open-source effort on GT4SD, we can help the scientific and technical communities more easily employ generative models for applications including the discovery of materials for climate and sustainability, design of new therapeutics and biomarkers, discovery [of] materials for next-generation computing devices, and more.”

Future opportunities and challenges for diffusion models

According to William Falcon, cofounder and CEO of Lightning AI, diffusion models will play an essential role in generative AI evolution as they have no appreciable disadvantages compared to previous architectures, a sole exception being that their generation is iterative and requires additional processing power.

“One area [where] I expect to see diffusion play a large role is in the buildout of VR and AR games and products,” he said. “We are already starting to see the community experimenting with diffusion-powered immersive environments and the generation of assets from individual shots. Asset generation has always been a big blocker in making virtual worlds thrive, and diffusion has the power to change everything there as well.”

Falcon said that although diffusion models unleash an entirely new dimension of creativity for people to express themselves, safety is and will continue to be a big theme.

“The standard safety filters are extremely bare-bones, and the datasets being used to train such models still sport a concerning amount of unsafe and biased material. Another methodological challenge is composition. In other words, controlling how different concepts are used together, either blended in the same subject or as distinct subjects side-by-side in the same creation,” he said.

Likewise, Fernando Lucini, global lead for data science & machine learning engineering at Accenture, said that the quality of generated images remains a challenge for the near future.

“I view this to be a problem between the combination of fidelity, meaning that a generated image looks reasonable to most people, and the perception of fidelity, which recognizes [that] what quality means to one person can differ from what it means to another,” Lucini told VentureBeat. ”We want images with high fidelity, especially if we’ve asked the model to produce a specific artistic style or a realistic item.”

Lucini believes that the future of these models is in generating imagery and video from plain text, which can play a role in evolving substantial generative machines that we can interact with more frequently.

“What we see in our daily lives, or what we can capture on a camera, can differ from an image generation, meaning that we have to contend with the fact that an image generation might have low fidelity and produce unwanted distortion, and that can take time to correct,” he said.

A new way of AI-based synthesis

Current market adoption of diffusion models

Future opportunities and challenges for diffusion models

More