Experimental AI framework Vx2Text generates video captions using inferences from audio and text

A grand challenge in AI is developing a conversational system that can reliably understand the world and respond using natural language. Ultimately, solving it will require a model capable of extracting salient information from images, text, audio, and video and answering questions in a way that humans can understand. In a step toward this, researchers at Facebook, Columbia University, Georgia Tech, and Dartmouth developed Vx2Text, a framework for generating text from videos, speech, or audio. They claim that Vx2Text can create captions and answer questions better than previous state-of-the-art approaches.

Unlike most AI systems, humans understand the meaning of text, videos, audio, and images together in context. For example, given text and an image that seem innocuous when considered apart (e.g., "Look how many people love you" and a picture of a barren desert), people recognize that these elements take on potentially hurtful connotations when they're paired or juxtaposed. Multimodal learning can carry complementary information or trends, which often only become evident when they're all included in the learning process. And this holds promise for applications from transcription to translating comic books into different languages.

In the case of Vx2Text, "modality-specific" classifiers convert semantic signals from videos, text, or audio into a common semantic language space. This enables language models to directly interpret multimodal data, opening up the possibility of carrying out multimodal fusion -- i.e., combining signals to bolster classification -- by means of powerful language models like Google's T5. A generative text decoder within Vx2Text transforms multimodal features computed by an encoder into text, making the framework suitable for generating natural language responses.

"Not only is such a design much simpler but it also leads to better performance compared to prior approaches," the researchers wrote in a paper describing their work. Helpfully, it also does away with the need to design specialized algorithms or resort to alternative approaches to combine the signals, they added.

In experiments, the researchers show that Vx2Text generates "realistic" natural text for both audio-visual "scene-aware" dialog and video captioning. Although the researchers provided the model with context in the form of dialog histories and speech transcripts, they note that the generated text includes information from non-text modalities, for example references to actions like helping someone to get up or answering a telephone.

Vx2Text has applications in the enterprise, where it could be used to caption recorded or streamed videos for accessibility purposes. Alternatively, the framework (or something like it) could find its way into video sharing platforms like YouTube and Vimeo, which rely on captioning, among other signals, to improve the relevancy of search results.

"Our approach hinges on the idea of mapping all modalities into a semantic language space in order to enable the direct application of transformer networks, which have been shown to be highly effective at modeling language problems," the researchers wrote. "This renders our entire model trainable end-to-end."

More