Google’s teaching AI systems to think more like children — at least, when it comes to object recognition and perception. In a paper (“Grasp2Vec: Learning Object Representations from Self-Supervised Grasping“) and accompanying blog post, Eric Jang, a software engineer at Google’s robotics division, and Coline Devin, a Ph.D. student at Berkeley and former research intern, describe an algorithm — Grasp2Vec — that “learns” the characteristics of objects by observing and manipulating them.
Their work comes a few months after MIT researchers demonstrated a computer vision system — dubbed Dense Object Nets, or DON for short — that allows robots to inspect, visually understand, and manipulate object they’ve never seen before. And it’s based on cognitive developmental research on self-supervision, the Google researchers explained.
People derive knowledge about the world by interacting with their environment, time-tested studies on object permanence have shown, and over time learn from the outcomes of the actions they take. Even grasping an object provides a lot of information about it — for example, the fact that it had to be within reach in the moments leading up to the grasp.
“In robotics, this type of … learning is actively researched because it enables robotic systems to learn without the need for large amounts of training data or manual supervision,” Jang and Devin wrote. “By using this form of self-supervision, [machines like] robots can learn to recognize … object[s] by … visual change[s] in the scene.”
The team collaborated with X Robotics to “teach” a robotic arm that could grasp objects “unintentionally,” and in the course of training learn representations of various objects. Those representations eventually led to “intentional grasping” of tools and toys chosen by the researchers.
The team leveraged reinforcement learning — an AI training technique that uses a system of rewards to drive agents toward specific goals — to encourage the arm to grasp objects, inspect them with its camera, and answer basic object recognition questions (“Do these objects match?”). And they implemented a perception system that could extract meaningful information about the items by analyzing a series of three images: an image before grasping, an image after grasping, and an isolated view of the grasped object.
In tests, Grasp2Vec and the researchers’ novel policies achieved a success rate of 80 percent, and worked even in cases where multiple objects matched the target and where the target consisted of multiple objects.
“We show how robotic grasping skills can generate the data used for learning object-centric representations,” they wrote. “We then can use representation learning to ‘bootstrap’ more complex skills like instance grasping, all while retaining the self-supervised learning properties of our autonomous grasping system. Going forward, we are excited not only for what machine learning can bring to robotics by way of better perception and control, but also what robotics can bring to machine learning in new paradigms of self-supervision.”