M12-backed TwentyBN raises $10 million to help AI interpret human behavior

Robots are a staple of industry. The International Federation of Robotics predicts that 1.7 million of them will find a home on factory floors worldwide by 2020. But most lack an intuitive understanding of human behavior. In order to work safely and effectively alongside human employees, they have to be painstakingly taught rules that account for every potential scenario they might encounter.

Twenty Billion Neurons (TwentyBN), a three-year-old a startup with offices in Berlin and Canada, believes there's a better way. Today, it announced a $10 million funding round led by M12, Microsoft's venture fund, with participation from Coparion, Creative Edge, and MFV Partners.

CEO and chief scientist Roland Memisevic said the company will use the new capital to scale its business.

“From day one, we’ve been committed to pushing decades of progress in AI and interactive computer vision into every corner of the world, be it the home, office, store, or a robot’s brain,” he told VentureBeat.

TwentyBN's novel computer vision systems can interact with humans while observing them using nothing but an off-the-shelf RGB camera. Its artificial intelligence (AI) not only responds to basic behaviors, but takes into account the surroundings and context of each engagement, providing human-like awareness of situations.

Memisevic contends that while AI image classification systems are well suited to detect objects, they don't come close to human-level autonomy. The key to true cognitive understanding lies in the ability to make sense of actions, he said.

At the core of TwentyBN's technology is a crowdsourced database of video clips that TwentyBN claims is the largest of its kind. Over the course of years, it's procured roughly 2 million clips from a network of volunteers who've acted out hundreds of thousands of scenes, a sampling of which it offers for free.

Its "Something Something" dataset comprises people performing basic actions with everyday objects, and its Jester dataset shows humans performing predefined hand gestures in front of a webcam. Researchers at the Massachusetts Institute of Technology recently sourced both to train an AI model that could predict how objects in videos will be changed or transformed by humans -- a sheet of paper being ripped into pieces, for example.

Sophisticated machine learning models trained on the datasets enable touchless, gesture-based interfaces for automotive, smart home, and retail applications. One model -- SuperModel -- detects body motions and human-object interactions. Two others, AirMouse and Gesture Recognition, collectively recognize over 30 dynamic hand motions in real time and track finger movements in the air.

Clients leverage the models through a software development kit that's compatible with a variety of platforms, including Docker, RIS, Vuforia, and Wikitude. In the case of AirMouse and Gesture Recognition, they're compatible with a wide range of hardware, including embedded systems, desktops, and mobile devices.

“TwentyBN is advancing how deep learning-based models recognize both the nouns and verbs describing a visual scene,” said M12 managing director Samir Kumar. “New intelligent camera experiences become possible when these models derived from their crowd-acting datasets are deployed to run efficiently on IoT edge devices.”

Memisevic, a CIFAR fellow from Germany and a graduate of the University of Toronto, was advised by artificial intelligence (AI) luminary Geoffrey Hinton before going on to teach machine learning at the University of Montreal. In 2015, he founded TwentyBN with two college friends, Ingo Bax and Christian Thurau.

More