Researchers at AI startup SenseTime, Amazon, and the Chinese University of Hong Kong say they’ve developed a new framework for leveraging web data — OmniSource — that notches records in the video recognition domain. By overcoming barriers between data formats like images, short videos, and long untrimmed videos and adopting good practices like data balancing, it’s ostensibly able to classify videos more accurately than state-of-the-art models while using up to 100 times less data.

In the future, OmniSource could be applied to security cameras within private and public places. Or it could inform the design of the moderation algorithms used on networks like Facebook.

As the researchers note, collecting the data required to train classification algorithms is costly and time-consuming. Because videos often contain multiple shots with one or more subjects, they must be watched in their entirety, manually cut into clips, and carefully annotated.

SenseTime Amazon OmniSource

Above: A diagram of OmniSource’s architecture.

OmniSource, then, exploits web data of various forms (e.g., images, trimmed videos, and untrimmed videos) from sources (search engines, social media) in an integrated way. An AI system filters out low-quality data samples and labels those that pass its muster (70% to 80% on average), transforming each to make it applicable for a target task while improving the robustness of the classification model’s training.

When given a recognition task, OmniSource obtains keywords for each class name in the taxonomy and crawls web data from the abovementioned sources, automatically discarding any duplicate data. For static images, to prep them for use during joint training, it generates “pseudo” videos by viewing them with a moving camera.

In the joint training phase, once the data has been filtered and transformed into the same format of that in the target data set, OmniSource balances the web and target corpora and employs a cross-data set mixup strategy, where pairs of examples and their labels are used for training. (The researchers report that cross-data mixup works well when the video recognition models are trained from scratch, albeit less well for fine-tuning.)

In tests, the team used three target data sets:

  • Kinematics-400, which contains 400 classes with 400 10-minute videos each
  • YouTube-car, which contains thousands of videos showcasing 196 types of different cars
  • UCF101, a video recognition data set with 100 clips and 101 classes

With respect to the web sources, they collected over 2 million images from Google Image Search, over 1.5 million images and 500,000 videos from Instagram, and over 17,000 videos from YouTube. In conjunction with the target data sets, all of these were fed into several video classification models.

The team reports that with only 3.5 million images and 800,000 minutes videos crawled from the internet without human labeling — less than 2% of prior works — the trained models exhibited at least a 3.0% accuracy improvement benchmarked against the Kinetics-400 data set, up to an accuracy of 83.6%. Meanwhile, their best trained-from-scratch model achieved 80.4% accuracy on Kinetics-400.

“Our framework can achieve comparable or better performance with a much simpler (also lighter) backbone design and smaller input size [than state of the art techniques],” wrote the coauthors of a paper describing OmniSource. “[It] leverages task-specific data collection and is more data-efficient, which greatly reduces the amount of data required … over previous methods. [Moreover, the] framework is generalizable to various video tasks such as video recognition and fine-grained categorization.”

How startups are scaling communication: The pandemic is making startups take a close look at ramping up their communication solutions. Learn how