OpenAI’s text-to-image engine, DALL-E, is a powerful visual idea generator

Once upon a time in Silicon Valley, engineers at the various electronics firms would tinker at their benches and create new inventions. This tinkering was done, at least in part, to show to the engineer at the next bench so they could both appreciate the ingenuity and inspire others. Some of this work eventually made it into products -- but much of it did not. This inefficiency that existed until the late 1980s was largely supplanted (by the bean counters first, and then marketing staffs), and product development shifted to focus instead on perceived customer desires.

News from OpenAI last week about DALL-E – an advanced artificial intelligence neural network that generates images from text prompts – is reminiscent of those earlier times. The OpenAI team acknowledged in their blog post that there is not a defined application they had in mind, and that there is the potential for unknown societal impacts and ethical challenges from the technology. But what is known is that, like those earlier inventions, DALL-E is something of a marvel concocted by the engineering team.

OpenAI chose the name DALL-E as a hat tip to the artist Salvador Dalí and Pixar’s WALL-E. It produces pastiche images that reflect both Dalí’s surrealism that merges dream and fantasy with the everyday rational world, as well as inspiration from NASA paintings from the 1950s and 1960s and those for Disneyland Tomorrowland by Disney Imagineers.

That DALL-E is a synthesis of surrealism and animation should not come as a surprise, as it has been done before. Dalí and Walt Disney collaborated on a short animation beginning in 1946, though it took more than 50 years before it was released. Named “Destino,” the film melded the styles of two legendary imaginative minds.

DALL-E is a 12-billion parameter version of the 175 billion parameter GPT-3 natural language processing neural network. GPT-3 “learns” based on patterns it discovers in data gleaned from the internet, from Reddit posts to Wikipedia to fan fiction and other sources. Based on that learning, GPT-3 is capable of many different tasks with no additional training, able to produce compelling narratives, generate computer code, translate between languages, and perform math calculations, among other feats, including autocompleting images.

With DALL-E, OpenAI has refined GPT-3 to focus on and extend the manipulation of visual concepts through language. It is trained to generate images from text descriptions using a dataset of text-image pairs. Both GPT-3 and DALL-E are “transformers,” an easy-to-parallelize type of neural network that can be scaled up and trained on huge datasets. DALL-E is not the first text-to-image network, as this synthesis has been an active area of research since 2016.

The OpenAI blog announcing DALL-E claims it provides access to a subset of the capabilities of a 3D rendering engine -- software that uses features of graphics cards to generate images displayed on screens or printed on a page -- via natural language. Architects use them to visualize buildings. Archeologists can recreate ancient structures. Advertisers and graphic designers use them to create more striking results. They are also used in video games, digital art, education, and medicine to offer more immersive experiences. The company further states that unlike a 3D rendering engine, whose inputs must be specified unambiguously and in complete detail, DALL-E is often able to “fill in the blanks” when the text prompt implies that the image must contain a certain detail that is not explicitly stated.

For example, DALL-E can combine disparate ideas to synthesize objects, some of which are unlikely to exist in the real world, such as this incongruous example merging a snail and a harp.

It is that “filling in the blanks” that is particularly interesting, as this suggests emergent capabilities -- unexpected phenomena that arise from complex systems. Human consciousness is the classic emergent example, a property of the brain that arises from the communication of information across all its regions. In this way, DALL-E is the next step in OpenAI’s mission to develop general artificial intelligence that benefits humanity.

How might DALL-E benefit humanity?

The company’s blog specifically mentions design as a possible use case. For example, a text prompt of “An armchair in the shape of an avocado. An armchair imitating an avocado,” yields the following images:

The text prompt "A female mannequin dressed in a black leather jacket and gold pleated skirt" yields the following.

And the text prompt "A loft bedroom with a white bed next to a nightstand. There is a fish tank standing next to the bed" yields the following:

In each of the examples above, DALL-E shows creativity, producing useful conceptual images for product, fashion, and interior design. I’ve shown only a subset of the images produced for each of the prompts, but they are the ones that most closely match the request. And they clearly show that DALL-E could support creative brainstorming, or augment human designers, either with thought starters or, one day, producing final conceptual images. Time will tell whether this will replace people performing these tasks or simply be another tool to boost efficiency and creativity.

A mental health aid

In response to another DALL-E demo, shown below, where the text prompt asks for "an illustration of a baby daikon radish in a tutu walking a dog,” a recent entry in “The Good Stuff” newsletter starts: “A baby daikon radish in a tutu walking a dog. The phrase makes me smile. The thought of it makes me smile. And the illustrations conjured by a new artificial intelligence model may be the only things single-handedly propping up my mental health.”

The newsletter writer could be onto something significant. The relationship between creating art and positive mental health is well known. It has spawned the field of art therapy, and visualization has long been a mainstay of psychotherapy. Art therapy professor Girija Kaimal notes: "Anything that engages your creative mind — the ability to make connections between unrelated things and imagine new ways to communicate — is good for you.” This is true for any visual creative expression: drawing, painting, photography, collaging, writing poetry, etc. This could extend to interacting with DALL-E, either to create something new or simply for a smile, or perhaps more significantly from a therapeutic perspective to give immediate visual representation to a feeling expressed in words.

Synthetic video on demand

As DALL-E already provides some 3D rendering engine capabilities via natural language input, it could be possible for the system to quickly produce storyboards. Conceivably, it could produce entirely synthetic videos based on a sequence of text statements. At its best, this might lead to greater efficiency in producing animations.

The creation of DALL-E harkens back to the time when engineers created without a clear signal from marketing to build a product. Discussing a fusion of language and vision, OpenAI Chief Scientist Ilya Sutskever believes the ability to process text and images together should make AI models smarter. If you can expose models to data in the same way it is absorbed by humans, the models should learn concepts in a way that is more similar to humans and that is more useful to a greater number of people. DALL-E is a considerable step forward in that direction.

Gary Grossman is the Senior VP of Technology Practice at Edelman and Global Lead of the Edelman AI Center of Excellence.

How might DALL-E benefit humanity?

A mental health aid

Synthetic video on demand

More