Google taps computer vision to improve robot manipulation performance

In a preprint paper, a Google and MIT team investigate whether pretrained visual representations can be used to improve a robot's object manipulation performance. They say their proposed technique -- affordance-based manipulation -- can enable robots to learn to pick and grasp objects in less than 10 minutes of trial and error, which could lay the groundwork for highly adaptable warehouse robots.

Affordance-based manipulation is a way to reframe a manipulation task as a computer vision task. Rather than referencing pixels to object labels, it associates pixels with the value of actions. Since the structure of computer vision models and affordance models are relatively similar, techniques from transfer learning can be applied to computer vision to enable affordance models to learn faster with less data -- or so the thinking goes.

To test this, the team injected the "backbones" -- i.e., the weights (or variables) responsible for early-stage image processing, like filtering edges, detecting corners, and distinguishing between colors -- of various popular computer vision models into affordance-based manipulation models pretrained on vision tasks. They then tasked a real-world robot with learning to grasp a set of objects through trial and error.

Initially, there weren't significant performance gains compared with training the affordance models from scratch. However, upon transferring weights from both the backbone and the head (which consists of weights used in latter-stage processing, such as recognizing contextual cues and executing spatial reasoning) of a pretrained vision model, there was a substantial improvement in training speed. Grasping success rates reached 73% in just 500 trial and error grasp attempts and jumped to 86% by 1,000 attempts. And on new objects unseen during training, models with the pretrained backbone and head generalized better, with grasping success rates of 83% with the backbone alone and 90% with both the backbone and head.

According to the team, reusing weights from vision tasks that require object localization (e.g., instance segmentation) significantly improved the exploration process when learning manipulation tasks. Pretrained weights from the tasks encouraged the robot to sample actions on things that look more like objects, thereby quickly generating a more balanced data set from which the system could learn the differences between good and bad grasps.

"Many of the methods that we use today for end-to-end robot learning are effectively the same as those being used for computer vision tasks," wrote the study's coauthors. "Our work here on visual pretraining illuminates this connection and demonstrates that it is possible to leverage techniques from visual pretraining to improve the learning efficiency of affordance-based manipulation applied to robotic grasping tasks. While our experiments point to a better understanding of deep learning for robots, there are still many interesting questions that have yet to be explored. For example, how do we leverage large-scale pretraining for additional modes of sensing (e.g. force-torque or tactile)? How do we extend these pretraining techniques toward more complex manipulation tasks that may not be as object-centric as grasping? These areas are promising directions for future research."

More