Nvidia researchers use AI to teach robots how to hand objects to humans

In a preprint research paper published this week, Nvidia researchers propose an approach for human-to-robot handovers in which the robot meets the human halfway, classifies the human's grasp, and plans a trajectory to take the object from the human's hand. They claim it results in more fluent handovers compared with baselines, and they say it could inform the design of collaborative warehouse robots that bolster workers' productivity.

As the coauthors explain, a growing body of research focuses on the problem of enabling seamless human-robot handovers. Most tackle the challenge of object transfer from the robot to the human, assuming that the human can place the object in the robot's gripper for the reverse. But the accuracy of human and object pose estimation is affected by occlusion -- i.e., when the object and hand are occluded by each other -- and the human often needs to pay attention to another task while transferring the object.

The Nvidia team discretized the ways in which humans can hold small objects into several categories, so that if a hand was grasping a block the pose could be categorized as "on-open-palm," "pinch-bottom," "pinch-top," "pinch-side," or "lifting." Then they used a Microsoft Azure Kinect depth camera to compile a data set to train an AI model to classify a hand holding an object into one of those categories, specifically by showing an example image of a hand grasp to the subject and recording the subject performing similar poses from 20-60 seconds. During the recording, the person could move his or her body and hand to different position to diversify the camera viewpoints, and subjects' left and right hands were captured for a total of 151,551 images.

The researchers modeled the handover task as what they call a "robust logical-dynamical system," which generates motion plans that avoid contact between the gripper and the hand given a certain classification. The system has to adapt to different possible grasps and reactively choose the way to approach the human and take the object from them. Until it gets a stable estimate of how the human wants to present the block, it stays in a "home" position and waits.

In a series of experiments, the researchers performed a systematic review on a range of different hand positions and grasps, including both the classification model and the task model. Two different Panda robots from Franka Amika were mounted on identical tables in different locations, to which human users handed four different colored blocks.

According to the coauthors, their method "consistently" improved grasp success rate and reduced the total execution time and the trial duration compared with existing approaches. It had a grasp success of 100% versus with the next best technique's 80%, and a planning success rate of 64.3% compared with 29.6%. Moreover, it took 17.34 seconds to plan and execute actions versus the 20.93 seconds the second-fastest system took.

"In general, our definition of human grasps covers 77% of the user grasps even before they know the ways of grasps defined in our system," wrote the researchers. "While our system can deal most of the unseen human grasps, they tend to lead to higher uncertainty and sometimes would cause the robot to back off and replan. ... This suggests directions for future research; ideally we would be able to handle a wider range of grasps that a human might want to use."

In the future, they plan to adapt the system to learn grasp poses for different grasp types from data instead of manually specified rules.