There’s much robots can achieve by observing human demonstrations, like the actions necessary to move a box of crackers from a counter to storage. But imitation learning is by no means a perfect science — demonstrators often complete subgoals that distract systems from overarching tasks.

To solve this, researchers at the University of Washington, Stanford University, the University of Illinois Urbana-Champaign, the University of Toronto, and Nvidia propose an “inverse planning” system that taps motions or low-level trajectories to capture the intention of actions. After evaluating their technique by collecting and testing against a corpus of video demonstrations conditioned on a set of kitchen goals, the team reports that their motion reasoning approach improves task success by over 20%.

The researchers lay out the full extent of the problem in a preprint paper detailing their work. In an environment like, say, a cluttered kitchen, they note that objects are configured in such a way that the goal is obfuscated. Recognizing an action sequence isn’t enough, because a task could have myriad motivations. For example, a demonstrator might move a tablecloth both to remove it from view and reach a knife underneath it.

Robot demonstration

The researchers’ AI system, then, outputs the symbolic goal of a task given a real-world video demonstration, which can then be used as input for robotics systems to reproduce said task. To test it, they had it learn a 24-task cooking objective where a human cook poured and prepped ingredients — tomato soup and spam — which were initially blocked by three objects, including a cracker box, a mustard bottle, and a sugar box. They recorded a total of four demonstrations for each task, resulting in a total of 96 demonstrations (excluding videos with substantial missing poses), and then they divided the tasks in two — 12 for system training and 12 for testing.

The researchers say that their full model explicitly performed motion reasoning about the objects in the demonstration, and thus wouldn’t blindly take all the object movements as intentional. Additionally, they note that it enabled imitation learning across different environments. In one experiment, the system managed to successfully extract the correct goal despite the manipulation of an object (the aforementioned sugar box). Although the sugar box appeared in the kitchen, the robot recognized it didn’t need to move it because it was already out of the way.

Robot demonstration

“Our results show that this allows us to significantly outperform previous approaches that aim to infer the goal based on either just motion planning or task planning,” wrote the coauthors. “In addition, we show that our goal-based formulation enables the robot to reproduce the same goal in a real kitchen by just watching the video demonstration from a mockup kitchen.”