DeepMind trains robots to insert USB keys and stack colored blocks

Robots perform better at a range of tasks when they draw on a growing body of experience. That's the assertion of a team of researchers hailing from DeepMind, who in a preprint paper propose a technique called reward sketching. They claim it's an effective way of eliciting human preferences to learn a reward function -- a function describing how an AI agent should behave -- that can be used to retrospectively annotate all historical data, collected for different tasks with predicted rewards for the new task. This annotated data set can then be used to learn manipulation policies, the team says, or probability distributions over actions given certain states, with reinforcement learning from visual input without interaction with a real robot.

The work builds on a DeepMind study published in January 2020, which described a technique -- continuous-discrete hybrid learning -- that optimizes for discrete and continuous actions simultaneously, treating hybrid problems in their native form. As something of a precursor to that paper, in October 2019, the Alphabet subsidiary demonstrated a novel way of transferring skills from simulation to a physical robot.

"[Our] approach makes it possible to scale up RL in robotics, as we no longer need to run the robot for each step of learning. We show that the trained batch [reinforcement learning] agents, when deployed in real robots, can perform a variety of challenging tasks involving multiple interactions among rigid or deformable objects," wrote the coauthors of this latest paper. "Moreover, they display a significant degree of robustness and generalization. In some cases, they even outperform human teleoperators."

As the team explains, at the heart of reward sketching are three key ideas: efficient elicitation of user preferences to learn reward functions, automatic annotation of all historical data with learned reward functions, and harnessing the data sets to learn policies from stored data via reinforcement learning.

For instance, a human teleoperates a robot with a six-degree-of-freedom mouse and a gripper button or a handheld virtual reality controller to provide first-person demonstrations of a target task. To specify a new target task, the operator controls the robot to provide several successful (and optionally unsuccessful) examples of completing the task, and these demonstrations help to bootstrap the reward learning by providing examples of successful behavior with high rewards.

In the researchers' proposed approach, all robot experience -- including demonstrations, teleoperated trajectories, human play data, and experience from the execution of either scripted or learned policies -- is accumulated into what's called NeverEnding Storage (NES). A metadata system implemented as a relational database ensures it's appropriately annotated and queried; it attaches environment and policy metadata to every trajectory, as well as arbitrary human-readable labels and reward sketches.

In the reward-sketching phase, humans annotate a subset of episodes from NES (including task-specific demos) with annotations of reward, using a technique that allows a single person to produce hundreds of annotations per minute. These annotations feed into a reward model that's then used to predict reward values for all experience in NES, so that all historical data in a training policy for a new task can be leveraged without requiring manual annotation of the whole repository.

An agent is trained with 75% of the batch drawn from the entirety of NES and 25% from the data specific to the target task. Then, it's deployed to a robot, which enables the collection of more experience to be used for reward sketching or reinforcement learning.

In experiments, the DeepMind team used a Sawyer robot with a gripper and a wrist force-torque sensor. Observations were provided by three cameras around a cage, as well as two wide-angle cameras and one depth camera mounted at the wrist and proprioceptive sensors in the arm. In total, the team collected over 400 hours of multiple-camera videos of proprioception -- i.e., perception or awareness of position and movement) -- and actions from behavior generated by human teleoperators, as well as random, scripted, and policies.

The researchers trained multiple reinforcement learning agents in parallel for 400,000 steps and evaluated the most promising on the real-world robot. Tasked with lifting and stacking rectangular objects, the Sawyer successfully lifted 80% of the time and stacked 60% of the time, and 80% and 40% of the time when those objects were positioned in "adversarial" ways. Perhaps more impressively, in a separate task involving the precise insertion of a USB key into a computer port, the agent -- when provided reward sketches from over 100 demonstrators -- reached over 80% success rate within 8 hours.

"The multi-component system allows a robot to solve a variety of challenging tasks that require skillful manipulation, involve multi-object interaction, and consist of many time steps," wrote the researchers. "There is no need to worry about wear and tear, limits of real time processing, and many of the other challenges associated with operating real robots. Moreover, researchers are empowered to train policies using their batch [reinforcement learning] algorithm of choice."

They leave to future work identifying ways to minimize human-in-the-loop training, and to minimize the agents' sensitivity to "significant perturbations" in the setup.