The folks at Google have devised AI capable of predicting which machine learning models will produce the best results. In a newly published paper (“Off-Policy Evaluation via Off-Policy Classification“) and blog post, a team of Google AI researchers propose what they call “off-policy classification,” or OPC, which evaluates the performance of AI-driven agents by treating evaluation as a classification problem.

The team notes that their approach — a variant of reinforcement learning, which employs rewards to drive software policies toward goals — works with image inputs and scales to tasks, including vision-based robotic grasping. “Fully off-policy reinforcement learning is a variant in which an agent learns entirely from older data, which is appealing because it enables model iteration without requiring a physical robot,” writes Robotics at Google software engineer Alex Irpan. “With fully off-policy RL, one can train several models on the same fixed dataset collected by previous agents, then select the best one.”

Arriving at OPC was a bit more challenging than it sounds. As Irpan and fellow coauthors note, off-policy reinforcement learning enables AI model training with, say, a robot, but not evaluation. Furthermore, they point out that ground-truth evaluation is generally too inefficient in methods that require evaluating a large number of models.

Their solution — OPC — addresses this by assuming that tasks at hand have little-to-no randomness involved in how states change and by assuming that agents either succeed or fail at the end of experimental trials. The binary nature of the second of the two assumptions allowed the assignment of two classification labels (“effective” for success or “catastrophic” for failure) to each action.

Google AI research

Above: On the left is a baseline. On the right is one of the proposed methods, the SoftOPC.

Image Credit: Google

OPC additionally relies on what’s called a Q-function (learned with a Q-learning algorithm) to estimate actions’ future total rewards. Agents choose actions with the largest projected rewards, and their performances are measured by how often the selected actions are effective  (which depends on how well the Q-function correctly classifies actions as effective versus catastrophic). The classification accuracy acts as an off-policy evaluation score.

The team trained machine learning policies in simulation using fully off-policy reinforcement learning and then evaluated them using the off-policy scores tabulated from previous real-world data. In a robot grasping task, they report that one variant of OPC in particular — SoftOPC — performed best at predicting final success rates. Given 15 models of varying robustness (seven of which were trained purely in simulation), SoftOPC generated scores closely correlated with true grasp success and “significantly” more reliable than baseline methods.

In future work, the researchers intend to explore tasks with “noisier” and nonbinary dynamics. “[W]e think the results are promising enough to be applied to many real-world RL problems,” wrote Irpan.