How DALL-E 2 could solve major computer vision challenges

OpenAI has recently released DALL-E 2, a more advanced version of DALL-E, an ingenious multimodal AI capable of generating images purely based on text descriptions. DALL-E 2 does that by employing advanced deep learning techniques that improve the quality and resolution of the generated images and provides further capabilities such as editing an existing image, or creating new versions of it.

Many AI enthusiasts and researchers tweeted about how amazing DALL-E 2 is at generating art and images out of a thin word, yet in this article I’d like to explore a different application for this powerful text-to-image model -- generating datasets to solve computer vision’s biggest challenges.

Caption: A DALL-E 2 generated image. “A rabbit detective sitting on a park bench and reading a newspaper in a Victorian setting.” Source: Twitter

Computer vision’s shortcomings

Computer vision AI applications can vary from detecting benign tumors in CT scans to enabling self-driving cars. Yet what is common to all is the need for abundant data. One of the most prominent performance predictors of a deep learning algorithm is the size of the underlying dataset it was trained on. For example, the JFT dataset, which is an internal Google dataset used for the training of image classification models, consists of 300 million images and more than 375 million labels.

Consider how an image classification model works: A neural network transforms pixel colors into a set of numbers that represent its features, also known as the “embedding” of an input. Those features are then mapped to the output layer, which contains a probability score for each class of images the model is supposed to detect. During training, the neural network tries to learn the best feature representations that discriminate between the classes, e.g. a pointy ear feature for a Dobermann vs. a Poodle.

Ideally, the machine learning model would learn to generalize across different lighting conditions, angles, and background environments. Yet more often than not, deep learning models learn the wrong representations. For example, a neural network might deduce that blue pixels are a feature of the “frisbee” class because all the images of a frisbee it has seen during training were on the beach.

One promising way of solving such shortcomings is to increase the size of the training set, e.g. by adding more pictures of frisbees with different backgrounds. Yet this exercise can prove to be a costly and lengthy endeavor.

First, you would need to collect all the required samples, e.g. by searching online or by capturing new images. Then, you would need to ensure each class has enough labels to prevent the model from overfitting or underfitting to some. Lastly, you would need to label each image, stating which image corresponds to which class. In a world where more data translates into a better-performing model, these three steps act as a bottleneck for achieving state-of-the-art performance.

But even then, computer vision models are easily fooled, especially if they are being attacked with adversarial examples. Guess what is another way to mitigate adversarial attacks? You guessed right -- more labeled, well-curated, and diverse data.

Caption: OpenAI’s CLIP wrongly classified an apple as an iPod due to a textual label. Source: OpenAI

Enter DALL-E 2

Let’s take an example of a dog breed classifier and a class for which it is a bit harder to find images -- Dalmatian dogs. Can we use DALL-E to solve our lack-of-data problem?

Consider applying the following techniques, all powered by DALL-E 2:

Except for generating more training data, the huge benefit from all of the above techniques is that the newly generated images are already labeled, removing the need for a human labeling workforce.

While image generating techniques such as generative adversarial networks (GAN) have been around for quite some time, DALL-E 2 differentiates in its 1024x1024 high-resolution generations, its multimodality nature of turning text into images, and its strong semantic consistency, i.e. understanding the relationship between different objects in a given image.

Automating dataset creation using GPT-3 + DALL-E

DALL-E’s input is a textual prompt of the image we wish to generate. We can leverage GPT-3, a text generating model, to generate dozens of textual prompts per class that will then be fed into DALL-E, which in turn will create dozens of images that will be stored per class.

For example, we could generate prompts that include different environments for which we would like DALL-E to create images of dogs.

Caption: A GPT-3 generated prompt to be used as input to DALL-E . Source: author

Using this example, and a template-like sentence such as “A [class_name] [gpt3_generated_actions]," we could feed DALL-E with the following prompt: “A Dalmatian laying down on the floor.” This can be further optimized by fine-tuning GPT-3 to produce dataset captions such as the one in the OpenAI Playground example above.

To further increase confidence in the newly added samples, one can set a certainty threshold to select only the generations that have passed a specific ranking, as every generated image is being ranked by an image-to-text model called CLIP.

Limitations and mitigations

If not used carefully, DALL-E can generate inaccurate images or ones of a narrow scope, excluding specific ethnic groups or disregarding traits that might lead to bias. A simple example would be a face detector that was only trained on images of men. Moreover, using images generated by DALL-E might hold a significant risk in specific domains such as pathology or self-driving cars, where the cost of a false negative is extreme.

DALL-E 2 still has some limitations, with compositionality being one of them. Relying on prompts that, for example, assume the correct positioning of objects might be risky.

Caption: DALL-E still struggles with some prompts. Source: Twitter

Ways to mitigate this include human sampling, where a human expert will randomly select samples to check for their validity. To optimize such a process, one can follow an active-learning approach where images that got the lowest CLIP ranking for a given caption are prioritized for a review.

Final words

DALL-E 2 is yet another exciting research result from OpenAI that opens the door to new kinds of applications. Generating huge datasets to address one of computer vision’s biggest bottlenecks–data is just one example.

OpenAI signals it will release DALL-E sometime during this upcoming summer, most likely in a phased release with a pre-screening for interested users. Those who can’t wait, or who are unable to pay for this service, can tinker with open source alternatives such as DALL-E Mini (Interface, Playground repository).

While the business case for many DALL-E-based applications will depend on the pricing and policy OpenAI sets for its API users, they are all certain to take image generation one big leap forward.

Sahar Mor has 13 years of engineering and product management experience focused on AI products. He is currently a Product Manager at Stripe, leading strategic data initiatives. Previously, he founded AirPaper, a document intelligence API powered by GPT-3 and was a founding Product Manager at Zeitgold (Acq. By Deel), a B2B AI accounting software company where he built and scaled its human-in-the-loop product, and Levity.ai, a no-code AutoML platform. He also worked as an engineering manager in early-stage startups and at the elite Israeli intelligence unit, 8200.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!