It could be argued that videos are the lifeblood of social media — other than selfies, of course. Facebook clips alone get over 8 billion average views a day and more than 100 million hours of watch time daily, and moreover, than 45 percent of people say they watch more than an hour of Facebook or YouTube videos a week.
The trouble with video, though, is that it’s exclusionary — folks with disabilities or poor internet connectivity can’t easily participate. It’s with that in mind researchers at Facebook created VideoStory, a new dataset for video descriptions intended to help train systems that can “automatically tell stories.”
They describe it in a new paper (“A Dataset for Telling the Stories of Social Media Videos“) published ahead of the Conference on Empirical Methods in Natural Language Processing (EMNLP) in Brussels, Belgium.
“Video content on social media platforms constitutes a major part of the communication between people, as it allows everyone to share their stories,” the researchers wrote. “However, if someone is unable to consume video … this severely limits their … communication. Automatically telling the stories using multi-sentence descriptions of videos would allow bridging this gap.”
To compile the dataset of 20,000 videos and 123,000 descriptive sentences, the team set out to find videos with “high engagement” on social media — i.e., popular uploads with a large number of comments and shares that prompted interactions between people.
The challenge was integrating information from each video into detailed captions describing the sequence of events. As the paper’s authors noted, existing datasets such as Stanford’s ActivityNet Captions focus on sets of preselected human activities, whereas social media videos cover a wide range of topics and categories.
For each of the videos, which ranged in length from 20 to 180 seconds, the team supplied annotated paragraphs describing objects, situations, and important details, and lined up the sentences with corresponding timestamps. In the end, clips had about five sentences on average, each aligned to roughly 18 seconds of footage on average.
The importance of context
The next step was training an AI system that would use VideoStory to caption videos automatically. A total of 17,098 videos were reserved for training, and 999 and 1,011 were set aside for validation and testing, respectively.
First, the team used a recurrent neural network — a type of neural network commonly employed in natural language processing tasks — to describe each segment of a given video. And to ensure the overall system took into account correlations between past and future events, they incorporated context from each previous segment description with a second machine learning model.
The captions it generated weren’t consistently right — a video of a baby playing outside with two dogs yielded captions like, “A dog is standing in the middle of a house,” and “The dog runs around the room and the dog jumps up and down” — but the results demonstrated that the model, trained on the VideoStory dataset, benefited from the addition of contextual information.
“High-quality video descriptions are more than bags of single-sentence captions; they should tell a coherent story,” they wrote. “[Our] evaluations show that our dataset is complimentary from prior work due to more diverse topics and the selection of engaging videos which tell a story. Our VideoStory dataset can serve as a good benchmark to build models for story understanding and multi-sentence video description.”