Google opens AVA dataset to help machines identify human actions in videos

Computer vision is emerging as a major boon for tech companies looking to bring machines up to speed and perform tasks hitherto only achievable by humans.

In the past few months alone, eBay has revealed big plans to roll out a new search feature that lets you use existing photos to find similar items, while online clothing retailer ASOS announced something similar in the fashion realm. Taking things to the next level, Shutterstock last week unveiled a neat new experimental feature that allows users to search for stock photos based on their spatial composition, and a few days back Google's Photos app garnered a new image recognition feature for pets.

Put simply, things are getting pretty exciting in the field of computer vision, and we're starting to see results from the growing investment across the AI sphere.

Video gaga

Many of the computer vision developments that have already made it into actual products involve static image-based applications, but we're beginning to see the fruits of computer vision technology in video, too. Russian authorities deployed facial recognition smarts across the country's CCTV network, for example. Pornhub is doing something similar to automatically categorize "adult entertainment" videos, including training the system to recognize specific sexual positions. Then there is the burgeoning autonomous vehicle industry that leans heavily on machines' ability to understand real-world actions.

Against this backdrop, Google has launched a new video dataset it hopes will be used to "accelerate research" into computer vision applications that involve recognizing actions within videos. AVA, an acronym for "atomic visual actions," is a dataset made up of multiple labels for people doing things in video sequences.

The challenge of identifying actions in videos is compounded in complex scenes where multiple actions are combined and carried out by different people.

"Teaching machines to understand human actions in videos is a fundamental research problem in Computer Vision, essential to applications such as personal video search and discovery, sports analysis, and gesture interfaces," explained Google software engineers Chunhui Gu and David Ross, in a blog post. "Despite exciting breakthroughs made over the past years in classifying and finding objects in images, recognizing human actions still remains a big challenge."

AVA is essentially a bunch of YouTube URLs annotated with a set of 80 atomic actions that extend across nearly 58 thousand video segments and cover everyday activities such as shaking hands, kicking, hugging, kissing, drinking, playing instruments, walking, and more.

By allowing anyone to access the dataset, Google is hoping to improve machines' "social visual intelligence" so they can understand what humans are doing and anticipate what they may do next.

"We hope that the release of AVA will help improve the development of human action recognition systems, and provide opportunities to model complex activities based on labels with fine spatio-temporal granularity at the level of individual person’s actions," the company said.

Video gaga

More