Stanford researchers propose AI that figures out how to use real-world objects

One longstanding goal of AI research is to allow robots to meaningfully interact with real-world environments. In a recent paper, researchers at Stanford and Facebook took a step toward this by extracting information related to actions like pushing or pulling objects with movable parts and using it to train an AI model. For example, given a drawer, their model can predict that applying a pulling force on the handle would open the drawer.

As the researchers note, humans interact with a plethora of objects around them. What makes this possible is our understanding of what can be done with each object, where this interaction may occur, and how we must move our bodies to accomplish it. Not only do people understand what actions will be successful, but they intuitively know which ones will not.

The coauthors considered long-term interactions with objects as sequences of short-term "atomic" interactions, like pushing and pulling. This limited the scope of their work to plausible short-term interactions a robot could perform given the current state of an object. These interactions were further decomposed into "where" and "how" -- for example, which handle on a cabinet a robot should pull and whether a robot should pull parallel or perpendicular to the handle.

These observations allowed the researchers to formulate their task as one of dense visual prediction. They developed a model that, given a depth or color image of an object, learned to infer whether a certain action could be performed and how it should be executed. For each pixel, the model provided an "actionability" score, action proposals, and success likelihoods.

"Our approach allows an agent to learn these by simply interacting with various objects, and recording the outcomes of its actions -- labeling ones that cause a desirable state change as successful," the coauthors wrote. "We empirically show that our method successfully learns to predict possible actions for novel objects, and does so even for previously unseen categories."

The researchers used a simulator called SAPIEN for learning and testing their approach across six types of interactions covering 972 shapes over 15 commonly seen indoor object categories. In experiments, they visualized the model's action scoring predictions over real-world 3D scans from open source datasets. While they concede that there's no guarantee for the predictions over pixels outside the articulated parts, the results made sense if motion was allowed for the entire objects.

"Our [model] learns to extract geometric features that are action-specific and gripper-aware. For example, for pulling, it predicted higher scores over high-curvature regions such as part boundaries and handles, while for pushing, almost all flat surface pixels belonging to a pushable part are equally highlighted and the pixels around handles are reasonably predicted to be not pushable due to object-gripper collisions ... While we use simulated environments for learning as they allow efficient interaction, we also find that our learned system generalizes to real-world scans and images."

The researchers admit that their work has limitations. For one, the model can only take a single frame as input, which introduces ambiguities if the articulated part is in motion. It's also limited to hard-coded motion trajectories. In future work, however, the coauthors plan to generalize the model to freeform interactions.

More