The immense potential and challenges of multimodal AI

Unlike most AI systems, humans understand the meaning of text, videos, audio, and images together in context. For example, given text and an image that seem innocuous when considered apart (e.g., "Look how many people love you" and a picture of a barren desert), people recognize that these elements take on potentially hurtful connotations when they're paired or juxtaposed.

While systems capable of making these multimodal inferences remain beyond reach, there's been progress. New research over the past year has advanced the state-of-the-art in multimodal learning, particularly in the subfield of visual question answering (VQA), a computer vision task where a system is given a text-based question about an image and must infer the answer. As it turns out, multimodal learning can carry complementary information or trends, which often only become evident when they're all included in the learning process. And this holds promise for applications from captioning to translating comic books into different languages.

Multimodal challenges

In multimodal systems, computer vision and natural language processing models are trained together on datasets to learn a combined embedding space, or a space occupied by variables representing specific features of the images, text, and other media. If different words are paired with similar images, these words are likely used to describe the same things or objects, while if some words appear next to different images, this implies these images represent the same object. It should be possible, then, for a multimodal system to predict things like image objects from text descriptions, and a body of academic literature has proven this to be the case.

There's just one problem: Multimodal systems notoriously pick up on biases in datasets. The diversity of questions and concepts involved in tasks like VQA, as well as the lack of high-quality data, often prevent models from learning to "reason," leading them to make educated guesses by relying on dataset statistics.

Key insights might lie in a benchmark test developed by scientists at Orange Labs and Institut National des Sciences Appliquées de Lyon. Claiming that the standard metric for measuring VQA model accuracy is misleading, they offer as an alternative GQA-OOD, which evaluates performance on questions whose answers can't be inferred without reasoning. In a study involving 7 VQA models and 3 bias-reduction techniques, the researchers found that the models failed to address questions involving infrequent concepts, suggesting that there's work to be done in this area.

The solution will likely involve larger, more comprehensive training datasets. A paper published by engineers at École Normale Supérieure in Paris, Inria Paris, and the Czech Institute of Informatics, Robotics, and Cybernetics proposes a VQA dataset created from millions of narrated videos. Consisting of automatically generated pairs of questions and answers from transcribed videos, the dataset eliminates the need for manual annotation while enabling strong performance on popular benchmarks, according to the researchers. (Most machine learning models learn to make predictions from data labeled automatically or by hand.)

In tandem with better datasets, new training techniques might also help to boost multimodal system performance. Earlier this year, researchers at Microsoft and the University of Rochester coauthored a paper describing a pipeline aimed at improving the reading and understanding of text in images for question answering and image caption generation. In contrast with conventional vision-language pretraining, which often fails to capture text and its relationship with visuals, their approach incorporates text generated from optical character recognition engines during the pretraining process.

Three pretraining tasks and a dataset of 1.4 million image-text pairs helps VQA models learn a better-aligned representation between words and objects, according to the researchers. "We find it particularly important to include the detected scene text words as extra language inputs," they wrote. "The extra scene text modality, together with the specially designed pre-training steps, effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text."

Beyond pure VQA systems, promising approaches are emerging in the dialogue-driven multimodal domain. Researchers at Facebook, the Allen Institute for AI, SRI International, Oregon State University, and the Georgia Institute of Technology propose "dialog without dialog," a challenge that requires visually grounded dialogue models to adapt to new tasks while not forgetting how to talk with people. For its part, Facebook recently introduced Situated Interactive MultiModal Conversations, a research direction aimed at training AI chatbots that take actions like showing an object and explaining what it’s made of in response to images, memories of previous interactions, and individual requests.

Real-world applications

Assuming the barriers in the way of performant multimodal systems are eventually overcome, what are the real-world applications?

With its visual dialogue system, Facebook would appear to be pursing a digital assistant that emulates human partners by responding to images, messages, and messages about images as naturally as a person might. For example, given the prompt "I want to buy some chairs -- show me brown ones and tell me about the materials," the assistant might reply with an image of brown chairs and the text "How do you like these? They have a solid brown color with a foam fitting."

Separately, Facebook is working toward a system that can automatically detect hateful memes on its platform. In May, it launched the Hateful Memes Challenge, a competition aimed at spurring researchers to develop systems that can identify memes intended to hurt people. The first phase of the one-year contest recently crossed the halfway mark with over 3,000 entries from hundreds of teams around the world.

At Microsoft, a handful of researchers are focused on the task of applying multimodal systems to video captioning. A team hailing from Microsoft Research Asia and Harbin Institute of Technology created a system that learns to capture representations among comments, video, and audio, enabling it to supply captions or comments relevant to scenes in videos. In a separate work, Microsoft coauthors detailed a model -- Multitask Multilingual Multimodal Pretrained model -- that learns universal representations of objects expressed in different languages, allowing it to achieve state-of-the-art results in tasks including multilingual image captioning.

Meanwhile, researchers at Google recently tackled the problem of predicting next lines of dialogue in a video. They claim that with a dataset of instructional videos scraped from the web, they were able to train a multimodal system to anticipate what a narrator would say next. For example, given frames from a scene and the transcript "I'm going to go ahead and slip that into place and I'm going to make note ... of which way the arrow is going in relation to the arrow on our guard. They both need to be going the same direction next," the model could correctly predict "Now slip that nut back on and screw it down" as the next phrase.

"Imagine that you are cooking an elaborate meal, but forget the next step in the recipe -- or fixing your car and uncertain about which tool to pick up next," the coauthors of the Google study wrote. "Developing an intelligent dialogue system that not only emulates human conversation, but also predicts and suggests future actions -- not to mention is able to answer questions on complex tasks and topics -- has long been a moonshot goal for the AI community. Conversational AI allows humans to interact with systems in free-form natural language."

Another fascinating study proposes using multimodal systems to translate manga, a form of Japanese comic, into other languages. Scientists at Yahoo! Japan, the University of Tokyo, and machine translation startup Mantra prototyped a system that translates texts in speech bubbles that can't be translated without context information (e.g., texts in other speech bubbles, the gender of speakers). Given a manga page, the system automatically translates the texts on the page into English and replaces the original texts with the translated ones.

Future work

At VentureBeat's Transform 2020 conference, as part of a conversation about trends for AI assistants, Prem Natarajan, Amazon head of product and VP of Alexa AI and NLP, and Barak Turovsky, Google AI director of product for the NLU team, agreed that research into multimodality will be of critical importance going forward. Turovsky talked about advances in surfacing the limited number of answers voice alone can offer. Without a screen, he pointed out, there's no infinite scroll or first page of Google search results, and so responses should be limited to three potential results, tops. For both Amazon and Google, this means building smart displays and emphasizing AI assistants that can both share visual content and respond with voice.

Turovsky and Natarajan aren't the only ones who see a future in multimodality, despite its challenges. OpenAI is reportedly developing a multimodal system trained on images, text, and other data using massive computational resources the company's leadership believes is the most promising path toward AGI, or AI that can learn any task a human can. And in a conversation with VentureBeat in January, Google AI chief Jeff Dean predicted progress in multimodal systems in the years ahead. The advancement of multimodal systems could lead to a number of benefits for image recognition and language models, he said, including more robust inference from models receiving input from more than a single medium.

"That whole research thread, I think, has been quite fruitful in terms of actually yielding machine learning models that [let us now] do more sophisticated NLP tasks than we used to be able to do," Dean told VentureBeat. "[But] we'd still like to be able to do much more contextual kinds of models."

Multimodal challenges

Real-world applications

Future work

More