Nvidia enters the text-to-image battle with eDiff-I, takes on DALL-E, Imagen

The domain of artificial intelligence (AI) text-to-image generators is the new battleground for tech conglomerates. Every AI-focused organization now aims to create a generative model that can showcase extraordinary detail and summon up mesmerizing images from relatively simple text prompts. After OpenAI’s DALL-E 2, Google’s Imagen and Meta’s Make-a-Scene made headlines with their image synthesis capabilities, Nvidia has entered the race with its text-to-image model called eDiff-I.

Unlike other major generative text-to-image models that perform image synthesis via an iterative denoising process, Nvidia’s eDiff-I uses an ensemble of expert denoisers specialized in denoising different intervals of the generative process.

Nvidia’s unique image synthesis algorithm

The developers behind eDiff-I describe the text-to-image model as “a new generation of generative AI content creation tool that offers unprecedented text-to-image synthesis with instant style transfer and intuitive painting-with-words capabilities.”

In a recently published paper, the authors say that present image synthesis algorithms rely heavily on the text prompt to create text-aligned information, whereas text conditioning is almost wholly disregarded, diverting the synthesis task to producing outputs of high visual fidelity. This led to the realization that there might be better ways to represent these unique modes of the generation process than sharing model parameters across the whole generation process.

“Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages,” said Nvidia’s research team in their paper. “To maintain training efficiency, we initially train a single model, which is then progressively split into specialized models that are further trained for the specific stages of the iterative generation process.”

eDiff-I’s image synthesis pipeline comprises a combination of three diffusion models — a base model that can synthesize samples of 64 x 64 resolution, and two super-resolution stacks that can upsample the images progressively to 256 x 256 and 1024 x 1024 resolution, respectively.

These models process an input caption by first computing its T5 XXL embedding and text embedding. The model architecture for eDiff-I also utilizes CLIP image encodings computed from a reference image. These image embeddings serve as a styled vector, further fed into cascaded diffusion models to progressively generate images of resolution 1024 x 1024.

These unique aspects allow eDiff-I to have a far greater level of control over the generated content. In addition to synthesizing text into images, the eDiff-I model has two additional features — style transfer, which allows you to control the style of the generated pattern using a reference image, and “paint with words,” an application in which the user can create images by drawing segmentation maps on a virtual canvas, a feature handy for scenarios where the user aims to create a specific desired image.

_{Image source: Nvidia AI.}

A new denoising process

Synthesis in diffusion models generally occurs through a series of iterative denoising processes that gradually generate images from random noise, with the same denoiser neural network being used throughout the entire denoising process. The eDiff-I model utilizes a unique denoising method where the model trains an ensemble of denoisers specialized for denoising at different intervals of the generative process. Nvidia refers to this new denoising network as “expert denoisers” and claims this process drastically improves the image-generation quality.

_{The denoising architecture used by eDiff-I. Image source: Nvidia AI.}

Scott Stephenson, CEO at Deepgram, says that the new methods presented in eDiff-I’s training pipeline could be inculcated for new versions of DALL-E or Stable Diffusion, where it can enable significant advances in quality and control over the synthesized images.

“It definitely adds to the complexity of training the model, but doesn’t significantly increase computational complexity in production use,” Stephenson told VentureBeat. “Being able to segment and define what each component of the resulting image should look like could accelerate the creation process in a meaningful way. In addition, it allows the human and the machine to work more closely together.”

Better than contemporaries?

While other state-of-the-art contemporaries such as DALL-E 2 and Imagen use only a single encoder such as CLIP or T5, eDiff-I’s architecture uses both encoders in the same model. Such an architecture enables eDiff-I to generate substantially diverse visuals from the same text input.

CLIP provides the created image with a stylized look; however, the output frequently misses text information. On the other hand, images created using T5 text embeddings can generate better individual objects. By combining them, eDiff-I produces images with both synthesis qualities.

_{Generating variations from the same text input. Image source: Nvidia AI.}

The development team also discovered that the more descriptive the text prompt, the better T5 performs than CLIP, and that combining the two results in better synthesis outputs. The model was also evaluated on standard datasets such as MS-COCO, indicating that CLIP+T5 embeddings provide significantly better trade-off curves than either alone.

Nvidia’s study shows that eDiff-I outperformed competitors like DALL-E 2, Make-a-Scene, GLIDE and Stable Diffusion based on the Frechet Inception Distance, or FID — a metric to evaluate the quality of AI-generated images. eDiff-I also achieved an FID score higher than Google’s Imagen and Parti.

_{Zero-shot FID comparison with recent state-of-the-art models on the COCO 2014 validation dataset. Image source: Nvidia AI.}

When comparing generated images through simple and long detailed captions, Nvidia’s study claims that both DALL-E 2 and Stable Diffusion failed to synthesize images accurately to the text captions. In addition, the study found that other generative models either produce misspellings or ignore some of the attributes. Meanwhile, eDiff-I could correctly model characteristics from English text on a wide range of samples.

But with that said, the research team also noted that they generated multiple outputs from each method and cherry-picked the best one to include in the figure.

_{Comparison of image generation through detailed captions. Image source: Nvidia AI.}

Current challenges for generative AI

Modern text-to-image diffusion models have the potential to democratize artistic expression by offering users the capability to produce detailed and high-quality imagery without the need for specialized skills. However, they can also be used for advanced photo manipulation for malicious purposes or to create deceptive or harmful content.

The recent progress of generative models and AI-driven image editing has profound implications for image authenticity and beyond. Nvidia says such challenges can be tackled by automatically validating authentic images and detecting manipulated or fake content.

The training datasets of current large-scale text-to-image generative models are mostly unfiltered and can include biases captured by the model and reflected in the generated data. Therefore, it is crucial to be aware of such biases in the underlying data and counteract them by actively collecting more representative data or using bias correction methods.

“Generative AI image models face the same ethical challenges as other artificial intelligence fields: the provenance of training data and understanding how it’s used in the model,” said Stephenson. “Big labeled-image datasets can contain copyrighted material, and it’s often impossible to explain how (or if) copyrighted material was incorporated into the final product.”

According to Stephenson, model training speed is another challenge that generative AI models still face, especially during their development phase.

“If it takes a model between 3 and 60 seconds to generate an image on some of the highest-end GPUs on the market, production-scale deployments will either require a significant increase in GPU supply or figure out how to generate images in a fraction of the time. The status quo isn’t scalable if demand grows by 10x or 100x,” Stephenson told VentureBeat.

The future of generative AI

Kyran McDonnell, founder and CEO at reVolt, said that although today’s text-to-image models do abstract art exceptionally well, they lack the requisite architecture to construct the priors necessary to comprehend reality properly.

“They’ll be able to approximate reality with enough training data and better models, but won’t truly understand it,” he said. “Until that underlying problem is tackled, we’ll still see these models making commonsense errors.”

McDonnell believes that next-gen text-to-image architectures, such as eDiff-I, will resolve many of the current quality issues.

“We can still expect composition errors, but the quality will be similar to where specialized GANs are today regarding face generation,” said McDonnell.

Likewise, Stephenson said that we’d be seeing more applications of generative AI in several application areas.

“Generative models trained on the style and general ‘vibe’ of a brand could generate an infinite variety of creative assets,” he said. “There’s plenty of room for enterprise applications, and generative AI hasn’t had its ‘mainstream moment’ yet.”