Generating live video captions with AI could bolster engagement on social media, or serve as a benchmark for the task of translating video to text. Preliminary work to this end has employed encoder-decoder models to generate comments, but they haven’t modeled the interaction between videos and comments explicitly, so they’ve tended to generate irrelevant comments.

That’s why a team of Microsoft Research Asia and Harbin Institute of Technology researchers propose a new technique in a preprint paper published on Arxiv.org. Their model iteratively learns to capture the representations among comments, video, and audio, and they say that in experiments, it outperforms state-of-the-art methods.

The system — the code for which is available on Github — matches the most relevant comments with videos from a candidate set so that it jointly learns cross-modal representations. It’s based on Google’s Transformers architecture, which like all neural networks contains functions (neurons) arranged in layers that transmit signals from data and slowly adjust the connections’ strength (weights). Uniquely, Transformers have attention, which means that every output element is connected to every input element, and the weightings between them are calculated dynamically.

Microsoft automatic caption

Above: An example of the produced comments of different models on a video. Above are three selected frames in the videos. Below are the existing comments in the video and the produced comments of different models. (The researchers’ model is the Matching Transformer.)

Image Credit: Microsoft

Concretely, the automatic live commenting system consists of three components: an encoder layer that converts different modalities of a video and a candidate comment into vectors (i.e., mathematical representations); a matching layer that learns the representation for each modality; and a prediction layer that outputs a score measuring the matching degree between a video clip and a comment. Given a video and a time-stamp, the model aims to select a comment from a candidate set that is most relevant to the video clip near the time-stamp based on the surrounding comments, the visual part, and the audio part. Comments are extracted near the time-stamp, and for the vision bit, the system samples video frames near the time-stamp.

The researchers evaluated the system on a video-comment data set containing 2,361 videos and 895,929 comments, collected from the Chinese video streaming platform Bilibili. And they constructed a candidate comments set in which each video clip contained 100 comments comprising the ground-truth comments, top 20 popular comments, and random selected comments.

According to the team, the model outperformed several baselines in terms of several measures, including relevance and correctness. In a clip featuring a soup dumpling, for example, it made comments about the dumplings exactly at key points in the video clip. “[W]e believe the multimodal pre-training will be a promising direction to explore, where tasks like image captioning and video captioning will benefit from pre-trained models,” wrote the researchers. “For future research, we will further investigate the multimodal interactions among vision, audio, and text in … real-world applications.”