Two years after DALL-E debut, its inventor is "surprised" by impact

Before DALL-E 2, Stable Diffusion and Midjourney, there was just a research paper called “Zero -Shot Text-to-Image Generation.”

With that paper and a controlled website demo, on January 5, 2021 — two years ago today — OpenAI introduced DALL -E, a neural network that “creates images from text captions for a wide range of concepts expressible in natural language.” (Also today: OpenAI just happens to reportedly be in talks for a "tender offer that would value it at $29 billion.")

The 12 billion-parameter version of Transformer language model GPT -3 was trained to generate images from text descriptions, using a dataset of text–image pairs. VentureBeat reporter Khari Johnson described the name as “meant to evoke the artist Salvador Dali and the robot WALL-E” and included a DALL-E generated illustration of a “baby daikon radish in a tutu walking a dog.”

_{Image by DALL-E}

Since then, things have moved fast, according to OpenAI researcher, DALL-E inventor and DALL-E 2 co-inventor Aditya Ramesh. It's more than a bit of an understatement, given the dizzying pace of development in the generative AI space over the past year. Then there was the meteoric rise of diffusion models, which were a game-changer for DALL-E 2, released last April, and its open-source counterparts, Stable Diffusion and Midjourney.

“It doesn't feel like so long ago that we were first trying this research direction to see what could be done,” Ramesh told VentureBeat. “I knew that the technology was going to get to a point where it would be impactful to consumers and useful for many different applications, but I was still surprised by how quickly.”

Now, generative modeling is approaching the point where "there'll be some kind of iPhone-like moment for image generation and other modalities," he said. "I'm excited to be able to build something that will be used for all of these applications that will emerge."

Original research developed in conjunction with CLIP

The DALL-E 1 research was developed and announced in conjunction with CLIP (Contrastive Language-Image Pre-training), a separate model based on zero-shot learning that was essentially DALL-E's secret sauce. Trained on 400 million pairs of images with text captions scraped from the internet, CLIP was able to be instructed using natural language to perform classification benchmarks and rank DALL-E results.

Of course, there were plenty of early signs that text-to-image progress was coming.

“It has been clear for years that this future was coming fast,” said Jeff Clune, associate professor, computer science at the University of British Columbia. In 2016, when his team produced what he says were the first synthetic images that were hard to distinguish from real images, Clune recalled speaking to a journalist.

“I was saying that in a few years, you’ll be able to describe any image you want and AI will produce it, such as ‘Donald Trump taking a bribe from Putin with a smirk on his face,’” he said.

Generative AI has been a core segment of AI research since the beginning, said Nathan Benaich, general partner at Air Street Capital. “It’s worth pointing out that research like the development of generative adversarial networks (GANs) in 2014 and DeepMind’s WaveNet in 2016 were already starting to show how AI models could generate new images and audio from scratch, respectively,” he told VentureBeat in a message.

Still, the original DALL-E paper was “quite impressive at the time,” added futurist, author and AI researcher Matt White. “Although it was not the first work in the area of text-to-image synthesis, OpenAI’s approach of promoting their work to the general public and not just in AI research circles garnered them a lot of attention and rightfully so," he said.

Pushing DALL-E research as far as possible

From the start, Ramesh says his main interest was to push the research as far as possible.

“We felt like text-to-image generation was interesting because as humans, we’re able to construct a sentence to describe any situation that we might encounter in real life, but also fantastical situations or crazy scenarios that are impossible,” he said. “So we wanted to see if we trained a model to just generate images from text well enough, whether it could do the same things that humans can as far as extrapolation.”

One of the main research influences on the original DALL-E, he added, was VQ -VAE, a technique pioneered by Aaron van den Oord, a DeepMind researcher, to break up images into tokens that are like the tokens on which language models are trained.

“So we can take a transformer like GPT, that is just trained to predict each word after the next, and augment its language tokens with these additional image tokens,” he explained. “That lets us apply the same technology to generate images as well.”

People were surprised by DALL-E, he said, because “it's one thing to see an example of generalization in language models, but when you see it in image generation, it's just a lot more visceral and impactful.”

DALL-E 2’s move towards diffusion models

But by the time the original DALL-E research was published, Ramesh’s co-authors for DALL-E 2, Alex Nichol and Prafulla Dhariwal, were already working on using diffusion models in a modified version of GLIDE (a new OpenAI diffusion model).

This led to DALL-E 2 being quite a different architecture from the first iteration of DALL-E. As Vasclav Kosar explained, "DALL-E 1 uses discrete variational autoencoder (dVAE), next token prediction, and CLIP model re-ranking, while DALL-E 2 uses CLIP embedding directly, and decodes images via diffusion similar to GLIDE."

“It seemed quite natural [to combine diffusion models with DALL-E] because there are many advantages that come with diffusion models — inpainting being the most obvious feature that's kind of really clean and elegant to implement using diffusion,” said Ramesh.

Incorporating one particular technique, used while developing GLIDE, into DALL-E 2 — classifier-free guidance — led to a drastic improvement in caption-matching and realism, he explained.

“When Alex first tried it out, none of us were expecting such a drastic improvement in the results,” he said. “My initial expectation for DALL-E 2 was that it would just be an update over DALL-E, but it was surprising to me that we got it to the point where it's already starting to be useful for people,” he said.

When the AI community and the general public first saw the image output of DALL-E 2 on April 6, 2022, the difference in image quality was, for many, jaw-dropping.

_{Image by DALL-E 2}

"Competitive, exciting, and fraught”

DALL-E's release in January 2021 was the first in a wave of text-to-image research that builds from fundamental advances in language and image processing, including variational auto-encoders and autoregressive transformers, Margaret Mitchell, chief ethics scientist at Hugging Face, told VentureBeat by email. Then, when DALL-E 2 was released, “diffusion was a breakthrough that most of us working in the area did not see, and it really upped the game,” she said.

These past two years since the original DALL-E research paper have been “competitive, exciting, and fraught,” she added.

“The focus on how to model language and images came at the expense of how best to acquire data for the model,” she said, pointing out that individual rights and consent are “all but abandoned” in modern-day text-to-image advances. Current systems are “essentially stealing artists' concepts without providing any recourse for the artists,” she concluded.

The fact that DALL-E did not make its source code available also led others to develop open-source text-to-image options that made their own splashes by the summer of 2022.

The original DALL-E was “interesting but not accessible,” said Emad Mostaque, founder of Stability AI, which released the first iteration of the open-source text-to-image generator Stable Diffusion in August, adding that “only the models my team trained were [open-source].” Mostaque added that "we started aggressively funding and supporting this space in summer of 2021.”

Going forward, DALL-E still has plenty of work to do, says White — even as it teases a new iteration coming soon.

“DALL-E 2 suffers from consistency, quality and ethical issues,” he said. It has issues with associations and composability, he pointed out, so a prompt like “a brown dog wearing a red shirt” can produce results where the attributes are transposed (i.e. red dog wearing a brown shirt, red dog wearing a red shirt or different colors altogether). In addition, he added, DALL-E 2 still struggles with face and body composition, and with generating text in images consistently — "especially longer words.”

The future of DALL-E and generative AI

Ramesh hopes that more people learn how DALL-E 2’s technology works, which he thinks will lead to fewer misunderstandings.

“People think that the way the model works is that it sort of has a database of images somewhere, and the way it generates images is by cutting and pasting together pieces of these images to create something new,” he said. “But actually, the way it works is a lot closer to a human where, when the model is trained on the images, it learns an abstract representation of what all of these concepts are.”

The training data “isn't used anymore when we generate an image from scratch,” he explained. “Diffusion models start with a blurry approximation of what they're trying to generate, and then over many steps, progressively add details to it, like how an artist would start off with a rough sketch and then slowly flesh it out over time.”

And helping artists, he said, has always been a goal for DALL-E.

“We had aspirationally hoped that these models would be a kind of creative copilot for artists, similar to how Codex is like a copilot for programmers — another tool you can reach for to make many day-to-day tasks a lot easier and faster,” he said. “We found that some artists find it really useful for prototyping ideas — whereas they would normally spend several hours or even several days exploring some concept before deciding to go with it, DALL-E could allow them to get to the same place in just a few hours or a few minutes.“

Over time, Ramesh said he hopes that more and more people get to learn and explore, both with DALL-E and with other generative AI tools.

“With [OpenAI's] ChatGPT, I think we’ve drastically expanded the outreach of what these AI tools can do and exposed a lot of people to using it,” he said. “I hope that over time people who want to do things with our technology can easily access it through our website and find ways to use it to build things that they’d like to see.”

[Updated by editor on 1/5/23 at 12:27 pm PT]