Picker robots — that is, motorized pincers which pick up and place things — might have repeatability in their favor, but complex poses and unfamiliar objects pose a challenge for most of them. It’s no wonder why: They not only have to locate objects and understand how to grasp them, which requires an enormous amount of training data, but they’ve got to set them down such that they don’t sustain damage or disturb their surroundings.

Leave it to the folks at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), though, to pioneer an approach that overcomes those barriers. In a newly published research paper (“Category-Level Robotic Manipulation with K-PAM: Key-Point Affordance Manipulation“), they describe a system — Keypoint Affordance Manipulation, or “kPAM” for short — that detects a collection of target coordinates called keypoints, enabling robotic hardware on which it’s deployed to handle a range of objects with finesse.

“Whenever you see a robot video on YouTube, you should watch carefully for what the robot is not doing,” MIT professor and senior author Russ Tedrake said in a statement. “Robots can pick almost anything up, but if it’s an object they haven’t seen before, they can’t actually put it down in any meaningful way.”

Most pick-and-place perception and grasping algorithms estimate positions, orientations, and geometries as opposed to points, which translates poorly to tasks involving oddly shaped objects. By contrast, kPAM’s three-dimensional keypoints pipeline can “naturally” accommodate variation among object types. Tedrake — along with PhD students Lucas Manuelli, Pete Florence, and Wei Gao — says that only three coordinates are required for a relatively uniform target, like a coffee mug (importantly, one on the bottom center and the top center), and that as little as six are sufficient for things like slippers, boots, and high heels.

MIT CSAIL

“Understanding just a little bit more about the object — the location of a few key points — is enough to enable a wide range of useful manipulation tasks,” Tedrake said.

The researchers tap a “state-of-the-art” integral AI model for keypoint detection, which takes a single RGB and depth image as input and outputs a probability heatmap and depth prediction map for each coordinate. (The 2D image coordinates, depth values, and final 3D keypoints are recovered in subsequent steps.) They collect training data from scenes containing objects of interest by projecting keypoint meshes into the image space, given an estimated camera pose from 3D reconstruction algorithms.

In experiments involving a Kuka IIWA LBR robot mounted with a Schunk WSG 50 gripping system (and a depth-sensing Primesense sensor), the fully trained system successfully guided the robot arm in placing shoes on a shoe rack, setting mugs upright on a shelf, and hanging the mugs on a rack by their handles.

The bot had no trouble with the test set of 20 shoes; out of 100 trials, it only twice failed to place a shoe on the rack. Errors arose when the gripper grasped the shoe by the heel, resulting in it shifting from its original position.

In the mug rack task, which involved a test set of 40 mugs varying in shape, size, and visual appearance, the robot managed to grip all cups when they were lying vertically, but only 19 when lying horizontally due to the gripper’s limited stroke. Impressively, it placed the mugs on the shelf within five centimeters of the target location in all but two trials (when the mug was placed upside down).

The mug handle test had a smaller set — 30 mugs — and five were very small mugs with handles measuring less than two centimeters in height and width. The gripper hung the larger mugs on the rack 100 percent of the time, but with the smaller mugs, it only achieved a 50 percent success rate. The researchers chalk up the fail cases to inaccurate keypoint detections.

There’s room for improvement in other areas, too. Tedrake and coauthors note that humans must annotate the training data their system requires, a process they intend to phase out in future work by supplementing real-world samples with synthetic data. And they concede that keypoints have to be relabeled and the model retrained even if the object category doesn’t change.

Still, they maintain that it affords greater flexibility than most current methods, and they believe that someday, it might help robots to undertake tasks like unloading dishwashers, wiping down kitchen counters, and performing pick-and-place jobs in factories and other industrial environments.