OpenAI debuts DALL-E for generating images from text

OpenAI today debuted two multimodal AI systems that combine computer vision and NLP: DALL-E, a system that generates images from text, and CLIP, a network trained on 400 million pairs of images and text.

The photo above was generated by DALL-E from the text prompt "an illustration of a baby daikon radish in a tutu walking a dog." DALL-E uses a 12-billion parameter version of GPT-3, and like GPT-3 is a Transformer language model. The name is meant to evoke the artist Salvador Dali and the robot WALL-E.

Tests OpenAI shared today appear to demonstrate that DALL-E has the ability to manipulate and rearrange objects in generated imagery and also create things that don't exist, like a cube with the texture of a porcupine or a cube of clouds. Based on text prompts, images generated by DALL-E can appear as if they were taken from the real world or can depict works of art. Visit the OpenAI website to try a controlled demo of DALL-E.

"We recognize that work involving generative models has the potential for significant, broad societal impacts. In the future, we plan to analyze how models like DALL-E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer-term ethical challenges implied by this technology," OpenAI wrote today in a blog post about DALL-E.

OpenAI also introduced CLIP, a multimodal model trained on 400 million pairs of images and text collected from the internet. CLIP uses zero-shot learning capabilities akin to GPT-2 and GPT-3 language models.

"We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pretraining, including object character recognition (OCR), geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models," 12 OpenAI coauthors write in a paper about the model.

Although testing found CLIP was proficient at a number of tasks, it fell short in specialization tasks, like satellite imagery classification or lymph node tumor detection.

"This preliminary analysis is intended to illustrate some of the challenges that general purpose computer vision models pose and to give a glimpse into their biases and impacts. We hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models, and we are excited to engage with the research community on such questions," the paper reads.

OpenAI chief scientist Ilya Sutskever was a coauthor of the paper detailing CLIP and may have alluded to the coming release of CLIP when he recently told deeplearning.ai that multimodal models would be a major machine learning trend in 2021. Google AI chief Jeff Dean made a similar prediction for 2020 in an interview with VentureBeat.

The release of DALL-E follows a number of generative models with the power to mimic or distort reality or predict how people paint landscapes and still lifes. But some, like StyleGAN, have also demonstrated racial bias.

OpenAI researchers working on CLIP and DALL-E called for additional research into the potential societal impact of both systems. GPT-3 displayed significant anti-Muslim bias and negative sentiment scores for Black people, so the same shortcomings could be embedded into DALL-E. A bias test included in the CLIP paper found that the model was most likely to miscategorize people under 20 as criminals or non-human. People classified as men were more likely to be labeled as criminals than people classified as women, and some label data contained in the dataset is heavily gendered.

How OpenAI made DALL-E and additional details will be shared in an upcoming paper. Large language models that use data scraped from the internet have been criticized by researchers who say the AI industry needs to undergo a culture change.

More