Google researchers use multiple cameras to reduce error rates for robot insertion and stacking tasks

Object-manipulating robots rely on cameras to make sense of the world around them, but these cameras often require careful installation and ongoing calibration and maintenance. A new study published by researchers at Google's Robotics division and Columbia University proposes a solution, which involves a technique that learns to accomplish tasks using multiple color cameras without an explicit 3D representation. They say that it achieves superior task performance on difficult stacking and insertion tasks compared with baselines.

This latest work builds on Google's vast body of robotics research. Last October, scientists at the company published a paper detailing a machine learning system dubbed Form2Fit, which aims to teach a picker robot with a suction arm the concept of assembling objects into kits. Google Brain researchers are pursuing a novel robot task planning technique involving deep dynamics models, or DDM, that they claim enables mechanical hands to manipulate multiple objects. And more recently, a Google team took the wraps off of ClearGrasp, an AI model that helps robots better recognize transparent objects.

As the researchers point out, until recently, most automated solutions were designed for rigid settings where scripted robot actions are repeated to move through a predefined set of positions. This approach calls for a highly calibrated setup that can be expensive and time-consuming, and one that lacks the robustness needed to handle changes in the environment. Advancements in computer vision have led to better performance in grasping, but tasks like stacking, insertion, and precision kitting remain challenging. That's because they require accurate 3D geometric knowledge of the task environment including object shape and pose, relative distances and orientation between locations, and other factors.

By contrast, the team's method leverages a multi-camera view and a reinforcement learning framework that takes in images from different viewpoints and produces robot actions in a closed-loop fashion. By combining and learning directly from the camera views without an intermediary reconstruction step, they say it's able to improve state estimation while at the same time increasing the robustness of the system's actions.

In experiments, the researchers deployed their setup to a simulated environment containing a Kuka arm equipped with a gripper, two bins placed in front of the robot, and three cameras mounted to overlook those bins. The arm was first tasked with stacking one bin with a single block in a random position, starting with a single block either blue or orange in color. In other tasks, it had to insert a block firmly into a middle fixture and to stack blocks one on top of the other.

The researchers ran 180 data collection jobs across 10 graphics cards to train their reinforcement learning model, with each producing roughly 5,000 episodes per hour for the insertion tasks. They report it achieved success, with "large reductions" to error rates on precision-based tasks -- specifically 49.18% on the first stacking task, 56.84% on the second stacking task, and 64.1% on the insertion task. "The effective use of multiple views enables a richer observation of the underlying state relevant to the task," wrote the paper's coauthors. "Our multi-view approach enables 3D tasks from RGB cameras without the need for explicit 3D representations and without camera-camera and camera-robot calibration. In the future, similar multi-view benefits can be achieved with a single mobile camera by learning a camera placement policy in addition to the task policy."

More