Google's Transporter Network uses minimal data to teach robots to stack blocks

Researchers at Google say they've developed an AI model architecture -- Transporter Network -- that enables object-grasping robots to reason about which visual cues are important and how they should be rearranged in a scene. During experiments, the researchers say their Transporter Networks achieved "superior" efficiency on a number of tasks including stacking a pyramid of blocks, assembling kits, manipulating ropes, and pushing piles of small objects.

Robot grasping is a challenge. For example, robots struggle to perform what's called "mechanical search," which is when they have to identify and pick up an object from within a pile of other objects. Most robots aren't especially adaptable, and there's a lack of sufficiently capable AI models for guiding robot pincers in mechanical search -- a problem that's come to the fore as the pandemic causes companies to consider adopting automation.

The Google study coauthors say Transporter Networks don't require any prior 3D model, pose, or class category knowledge of the objects to be manipulated, instead relying only on information contained within partial depth camera data. They're also capable of generalizing to new objects and configurations and, for some tasks, learning from a single demonstration. In fact, on 10 unique tabletop manipulation tasks, Transporter Networks trained from scratch ostensibly attained over 90% success on most tasks with objects in new configurations using 100 expert video demonstrations of the tasks.

The researchers trained Transporter Networks on datasets of demonstrations ranging in number from one demonstration to 1,000 per task. They first deployed them on Ravens, a simulated benchmark learning environment consisting of a Universal Robot UR5e device with a suction gripper overlooking a 0.5 x 1 meter workspace. Then they validated the Transporter Networks on kit assembly tasks using real UR5e robots with suction grippers and cameras including an Azure Kinect.

Because of pandemic-related lockdowns, the researchers performed their experiments by using a Unity-based program that enables people to remotely teleoperate robots. For one experiment, the teleoperators were tasked with repeatedly assembling and disassembling a kit of five small bottled mouthwashes or nine uniquely shaped wooden toys using either a virtual reality headset or mouse and keyboard to label picking and placing poses. The Transporter Networks, which were trained with 11,633 pick-and-place actions total on all tasks from 13 human operators, achieved 98.9% success in assembling kits of bottled mouthwashes.

"In this work, we presented the Transporter Network, a simple model architecture that infers spatial displacements, which can parameterize robot actions from visual input," the researchers wrote. "It makes no assumptions of objectness, exploits spatial symmetries, and is orders of magnitude more sample efficient in learning vision-based manipulation tasks than end-to-end alternatives ... In terms of its current limitations: it is sensitive to camera-robot calibration, and it remains unclear how to integrate torque and force actions with spatial action spaces. Overall, we are excited about this direction and plan to extend it to real-time high-rate control, and as well as tasks involving tool use."

The coauthors say they plan to release code and open-source Ravens (and an associated API) in the near future.

More