VentureBeat presents: AI Unleashed - An exclusive executive event for enterprise data leaders. Network and learn with industry peers. Learn More

Roughly a year ago, VentureBeat wrote about progress in the AI and machine learning field toward developing multimodal models, or models that can understand the meaning of text, videos, audio, and images together in context. Back then, the work was in its infancy and faced formidable challenges, not least of which concerned biases amplified in training datasets. But breakthroughs have been made.

This year, OpenAI released DALL-E and CLIP, two multimodal models that the research labs claims are a “a step toward systems with [a] deeper understanding of the world.” DALL-E, inspired by the surrealist artist Salvador Dalí, was trained to generate images from simple text descriptions. Similarly, CLIP (for “Contrastive Language-Image Pre-training”) was trained to associate visual concepts with language, drawing on example photos paired with captions scraped from the public web.

DALL-E and CLIP are only the tip of the iceberg. Several studies have demonstrated that a single model can be trained to learn the relationships between audio, text, images, and other forms of data. Some hurdles have yet to be overcome, like model bias. But already, multimodal models have been applied to real-world applications including hate speech detection.

Promising new directions

Humans understand events in the world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future. For example, given text and an image that seems innocuous when considered separately — e.g., “Look how many people love you” and a picture of a barren desert — people recognize that these elements take on potentially hurtful connotations when they’re paired or juxtaposed.


AI Unleashed

An exclusive invite-only evening of insights and networking, designed for senior enterprise executives overseeing data stacks and strategies.


Learn More
Merlot AI

Above: Merlot can understand the sequence of events in videos, as demonstrated here.

Even the best AI systems struggle in this area. But those like the Allen Institute for Artificial Intelligence’s and the University of Washington’s Multimodal Neural Script Knowledge Models (Merlot) show how far the literature has come. Merlot, which was detailed in a paper published earlier in the year, learns to match images in videos with words and follow events over time by watching millions of transcribed YouTube videos. It does all this in an unsupervised manner, meaning the videos don’t need to be labeled or categorized — the system learns from the videos’ inherent structures.

“We hope that Merlot can inspire future work for learning vision plus language representations in a more human-like fashion compared to learning from literal captions and their corresponding images,” the coauthors wrote in a paper published last summer. “The model achieves strong performance on tasks requiring event-level reasoning over videos and static images.”

In this same vein, Google in June introduced MUM, a multimodal model trained on a dataset of documents from the web that can transfer knowledge between languages. MUM, which doesn’t need to be explicitly taught how to complete tasks, is able to answer questions in 75 languages, including “I want to hike to Mount Fuji next fall — what should I do to prepare?” while realizing that “prepare” could encompass things like fitness as well as weather.

A more recent project from Google, Video-Audio-Text Transformer (VATT), is an attempt to build a highly capable multimodal model by training across datasets containing video transcripts, videos, audio, and photos. VATT can make predictions for multiple modalities and datasets from raw signals, not only successfully captioning events in videos but pulling up videos given a prompt, categorizing audio clips, and recognizing objects in images.

“We wanted to examine if there exists one model that can learn semantic representations of different modalities and datasets at once (from raw multimodal signals),” Hassan Akbari, a research scientist at Google who codeveloped VATT, told VentureBeat via email. “At first, we didn’t expect it to even converge, because we were forcing one model to process different raw signals from different modalities. We observed that not only is it possible to train one model to do that, but its internal activations show interesting patterns. For example, some layers of the model specialize [in] a specific modality while skipping other modalities. Final layers of the model treat all modalities (semantically) the same and perceive them almost equally.”

For their part, researchers at Meta, formerly Facebook, claim to have created a multimodal model that achieves “impressive performance” on 35 different vision, language, and crossmodal and multimodal vision and language tasks. Called FLAVA, the creators note that it was trained on a collection of openly available datasets roughly six times smaller — tens of millions of text-image pairs — than the datasets used to train CLIP, demonstrating its efficiency.

“Our work points the way forward towards generalized but open models that perform well on a wide variety of multimodal tasks” including image recognition and caption generation, the authors wrote in the academic paper introducing FLAVA. “Combining information from different modalities into one universal architecture holds promise not only because it is similar to how humans make sense of the world, but also because it may lead to better sample efficiency and much richer representations.”

Not to be outdone, a team of Microsoft Research Asia and Peking University researchers have developed NUWA, a model that they claim can generate new or edit existing images and videos for various media creation tasks. Trained on text, video, and image datasets, the researchers claim that NUWA can learn to spit out images or videos given a sketch or text prompt (e.g., “A dog with goggles is staring at the camera”), predict the next scene in a video from a few frames of footage, or automatically fill in the blanks in an image that’s partially obscured.

Microsoft NUWA

Above: NUWA can generate videos given a text prompt.

Image Credit: Microsoft

“[Previous techniques] treat images and videos separately and focus on generating either of them. This limits the models to benefit from both image and video data,” the researchers wrote in a paper. “NUWA shows surprisingly good zero-shot capabilities not only on text-guided image manipulation, but also text-guided video manipulation.”

The problem of bias

Multimodal models, like other types of models, are susceptible to bias, which often arises from the datasets used to train the models.

In a study out of the University of Southern California and Carnegie Mellon, researchers found that one open source multimodal model, VL-BERT, tends to stereotypically associate certain types of apparel, like aprons, with women. OpenAI has explored the presence of biases in multimodal neurons, the components that make up multimodal models, including a “terrorism/Islam” neuron that responds to images of words like “attack” and “horror” but also “Allah” and “Muslim.”

CLIP exhibits biases, as well, at times horrifyingly misclassifying images of Black people as “non-human” and teenagers as “criminals” and “thieves.” According to OpenAI, the model is also prejudicial toward certain genders, associating terms having to do with appearance (e.g., “brown hair,” “blonde”) and occupations like “nanny” with pictures of women.

Like CLIP, the Allen Institute and University of Washington researchers note that Merlot can exhibit undesirable biases because it was only trained on English data and largely local news segments, which can spend a lot of time covering crime stories in a sensationalized wayStudies have demonstrated a correlation between watching the local news and having more explicit, racialized beliefs about crime. It’s “very likely” that training models like Merlot on mostly news content could cause them to learn sexist patterns as well as racist patterns, the researchers concede, given that the most popular YouTubers in most countries are men.

In lieu of a technical solution, OpenAI recommends “community exploration” to better understand models like CLIP and develop evaluations to assess their capabilities — and potential for misuse (e.g., generating disinformation). This, they say, could help increase the likelihood multimodal models are used beneficially while shedding light on the performance gap between models.

Real-world applications

While some work remains firmly in the research phases, companies including Google and Facebook are actively commercializing multimodal models to improve their products and services.

For example, Google says it’ll use MUM to power a new feature in Google Lens, the company’s image recognition technology, that finds objects like apparel based on photos and high-level descriptions. Google also claims that MUM helped its engineers to identify more than 800 COVID-19 name variations in over 50 languages.

In the future, Google’s VP of Search Pandu Nayak says, MUM could connect users to businesses by surfacing products and reviews and improving “all kinds” of language understanding — whether at the customer service level or in a research setting. “MUM can understand that what you’re looking for are techniques for fixing and what that mechanism is,” he told VentureBeat in a previous interview. “The power of MUM is its ability to understand information on a broad level … This is the kind of thing that the multimodal [models] promise.”

Meta, meanwhile, reports that it’s using multimodal models to recognize whether memes violate its terms of service. The company recently built and deployed a system, Few-Shot Learner (FSL), that can adapt to take action on evolving types of potentially harmful content in upwards of 100 languages. Meta claims that, on Facebook, FSL has helped to identify content that shares misleading information in a way that would discourage COVID-19 vaccinations or that comes close to inciting violence.

Future multimodal models might have even farther-reaching implications.

Researchers at UCLA, the University of Southern California, Intuit, and the Chan Zuckerberg Initiative have released a dataset called Multimodal Biomedical Experiment Method Classification (Melinda) designed to see whether current multimodal models can curate biological studies as well as human reviewers. Curating studies is an important — yet labor-intensive — process performed by researchers in life sciences that requires recognizing experiment methods to identify the underlying protocols that net the figures published in research articles.

Even the best multimodal models available struggled on Melinda. But the researchers are hopeful that the benchmark motivates additional work in this area. “The Melinda dataset could serve as a good testbed for benchmarking [because] the recognition [task] is fundamentally multimodal [and challenging], where justification of the experiment methods takes both figures and captions into consideration,” they wrote in a paper.


Above: OpenAI’s DALL-E.

Image Credit: OpenAI

As for DALL-E, OpenAI predicts that it might someday augment — or even replace — 3D rendering engines. For example, architects could use the tool to visualize buildings, while graphic artists could apply it to software and video game design. In another point in DALL-E’s favor, the tool could combine disparate ideas to synthesize objects, some of which are unlikely to exist in the real world — like a hybrid of a snail and a harp.

Aditya Ramesh, a researcher working on the DALL-E team, told VentureBeat in an interview that OpenAI has been focusing for the past few months on improving the model’s core capabilities. The team is currently investigating ways to achieve higher image resolutions and photorealism, as well as ways that the next generation of DALL-E — which Ramesh referred to as “DALL-E v2” — could be used to edit photos and generate images more quickly.

A paper that Ramesh coauthored with fellow OpenAI researchers gives a glimpse into this future. It describes a multimodal model called Guided Language to Image Diffusion for Generation and Editing (GLIDE), which — like DALL-E — can create photos given a short text caption. But GLIDE can also be fine-tuned to edit existing images, for example swapping out a forest around a car for a tundra while matching the style and lighting of the original picture. In the paper, the coauthors show how GLIDE could be used to create a complex scene by generating an image (e.g., with the prompt “a cozy living room”); adding a painting to the wall, a coffee table, and a vase of flowers on the coffee table; and moving the wall up to the couch.


Above: OpenAI’s GLIDE can make edits to existing photos or generate new objects in photos given a text prompt.

Image Credit: OpenAI

“A lot of our effort has gone toward making these models deployable in practice and [the] sort of things we need to work on to make that possible,” Ramesh said. “We want to make sure that, if at some point these models are made available to a large audience, we do so in a way that’s safe.”

Far-reaching consequences

“DALL-E shows creativity, producing useful conceptual images for product, fashion, and interior design,” Gary Grossman, global lead at Edelman’s AI Center of Excellence, wrote in a recent opinion article. “DALL-E could support creative brainstorming … either with thought starters or, one day, producing final conceptual images. Time will tell whether this will replace people performing these tasks or simply be another tool to boost efficiency and creativity.”

It’s early days, but Grossman’s last point — that multimodal models might replace, rather than augment, humans — is likely to become increasingly relevant as the technology grows more sophisticated. (By 2022, an estimated 5 million jobs worldwide will be lost to automation technologies, with 47% of U.S. jobs at risk of being automated.) Another, related question unaddressed is how organizations with fewer resources will be able to leverage multimodal models, given the models’ relatively high development costs.

Another unaddressed question is how to prevent multimodal models from being abused by malicious actors, from governments and criminals to cyberbullies. In a paper published by Stanford’s Institute for Human-Centered Artificial Intelligence (HAI), the coauthors argue that advances in multimodal models like DALL-E will result in higher-quality, machine-generated content that’ll be easier to personalize for “misuse purposes” — like publishing misleading articles targeted to different political parties, nationalities, and religions.

“[Multimodal models] could … impersonate speech, motions, or writing, and potentially be misused to embarrass, intimidate, and extort victims,” the coauthors wrote. “Generated deepfake images and misinformation pose greater risks as the semantic and generative capability of vision foundation models continues to grow.”

Ramesh says that OpenAI has been studying filtering methods that could, at least at the API level, be used to limit the sort of harmful content that models like DALL-E generate. It won’t be easy — unlike the filtering technologies that OpenAI implemented for its text-only GPT-3 model, DALL-E’s filters would have to capable of detecting problematic elements in images and language that they hadn’t seen before. But Ramesh believe it’s “possible,” depending on which tradeoffs the lab decides to make.

“There’s a spectrum of possibilities for what we could do. For example, you could even filter all images of people out of the data, but then the model wouldn’t be very useful for a large number of applications — it probably wouldn’t know a lot about how the world works,” Ramesh said. “Thinking about the trade-offs there and how far to go so that the model is deployable, yet still useful, is something we’ve been putting a lot of effort into.”

Some experts argue that the inaccessibility of multimodal models threatens to stint progress on this sort of filtering research. Ramesh conceded that, with generative models like DALL-E, the training process is “always going to be pretty long and relatively expensive” — especially if the goal is a single model with a diverse set of capabilities.

As the Stanford HAI paper reads: “[T]he actual training of [multimodal] models is unavailable to the vast majority of AI researchers, due to the much higher computational cost and the complex engineering requirements … The gap between the private models that industry can train and the ones that are open to the community will likely remain large if not grow … The fundamental centralizing nature of [multimodal] models means that the barrier to entry for developing them will continue to rise, so that even startups, despite their agility, will find it difficult to compete, a trend that is reflected in the development of search engines.”

But as the past year has shown, progress is marching forward — consequences be damned.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.