User-generated content (UGC) could be a godsend for marketers. It promises to cut down on the roughly $10 billion spent on content in the U.S. alone, an estimated $1 billion of which is lost to waste due to inefficiencies. Moreover, there’s evidence to suggest it yields a better ROI than in-house media. According to Adweek, 64 percent of social media users seek UGC before making a purchasing decision, and UGC videos receive ten times more views than branded videos.

There’s a problem with UGC, though, and it’s a big one: Marketers often have to spend hours sifting through submissions to find relevant, repurposeable clips that fit a given theme. And it’s not getting any easier — last year, YouTube users uploaded 300 hours of videos every minute, and Cisco predicts that video will account for 82 percent of all web traffic by 2021.

That’s why Adobe is tapping artificial intelligence (AI) to expedite the process. It today introduced Smart Tags for video, a feature of Experience Manager (AEM) — the San Jose company’s content management solution for building websites, mobile apps, and forms — that automatically generates tags for the hundreds of thousands of UGC clips contributed each month.

Smart Tags for video is now available in beta for a select group of participants interested in enterprise use cases.

“Over the past two years, we’ve … invested a lot of the really high-end computer vision models [Adobe’s] research teams have come forward [with] and are basically using that to automate the curation process,” Santiago Pombo, product manager of AEM, told VentureBeat in a phone interview.

Adobe Smart Tags

Smart Tags for video — which Adobe Research and Adobe’s Search team architected jointly using Adobe’s Sensei machine learning platform — produces two sets of tags for each clip. One describes roughly 150,000 classes of objects, scenes, and attributes, and the second corresponds to actions such as drinking, running, and jogging.

Smart Tags for video’s underlying tech builds on AEM’s image auto-tagger, trained on a collection of images from Adobe Stock. The system ingests individual frames in the target video to produce the first set of tags. And the second set is the product of a tagging algorithm trained on curated “action-rich” videos with accompanying labels, scraped from metadata from an internal Adobe video dataset. It’s applied to multiple frames in the video, and the results are aggregated to yield the final action tag set.

A score from zero to 100 accompanies each tag — an estimate of the accuracy of the system’s prediction. AEM customers can mark tags the system doesn’t get quite right, which removes them from the search index and produces a record of the disassociation. A log of incorrectly tagged assets are sent to anotators as feedback.

What’s truly novel about Smart Tag for video, Pombo said, is that it enables users to create search rules and filters based on an assets content, rather than manual tags and descriptions alone. Additionally, it allows them to specify a minimum confidence threshold for a specific tag or set of tags, ensuring a relevant selection of assets.

“These tools were put in place to [help] cut the signal from noise,” Pombo said. “The quality of the results … is much higher.”

Engineering the AI system wasn’t a walk in the park. Collectively, AEM customers perform ten search queries per second on average, which posed a significant latency challenge. And the Adobe Research team had to design an annotation pipeline that could handle the sizeable volume of UGC coming in.

“On the application side, we made out timing out errors a little bit more liberal than they were before beforehand to give a bit more slack for the classification. [And we partnered] very closely with the R&D team to … do optimizations to do better and more efficient frame selection to have a better representation,” Pombo said. “We also have … [an] interesting … infrastructure or architecture design [that allows us to] basically perform a lot of the tasks in parallel.”

The result of all that hard work? Smart Tag for video can process videos in four seconds or less.

Future work will focus on expanding the volume of videos the system can recognize, Pombo said. The current iteration classifies clips 60 seconds in length.

“When we were like measuring trade-offs, we figure[d] that we were going to optimize for [an] 80 percent use case … but I do think the next step is to … increase it to 10 minutes,” he said.