Deep reinforcement learning — an AI training technique that employs rewards to drive software policies toward goals — has been tapped to model the impact of social norms, create AI that’s exceptionally good at playing games, and program robots that can recover from nasty spills. But despite its versatility, reinforcement learning (or “RL,” as it’s typically abbreviated) has a showstopping shortcoming: It’s inefficient. Training a policy requires lots of interactions within a simulated or real-world environment — far more than the average person needs to learn a task.

To remedy it somewhat in the video gaming domain, researchers at Google recently proposed a new algorithm — Simulated Policy Learning, or SimPLe for short — which uses game models to learn quality policies for selecting actions. They describe it in a newly published preprint paper (“Model-Based Reinforcement Learning for Atari“) and in documentation accompanying the open-sourced code.

“At a high-level, the idea behind SimPLe is to alternate between learning a world model of how the game behaves and using that model to optimize a policy (with model-free reinforcement learning) within the simulated game environment,” wrote Google AI scientists Łukasz Kaiser and Dumitru Erhan. “The basic principles behind this algorithm are well established and have been employed in numerous recent model-based reinforcement learning methods.”

As the two researchers further explain, training an AI system to play games requires predicting the target game’s next frame given a sequence of observed frames and commands (e.g., “left,” “right,” “forward,” “backward”). A successful model, they point out, can produce trajectories that could be used to train a gaming agent policy, which would obviate the need to rely on computationally costly in-game sequences.

Google SimPle

Above: The SimPle model applied to Kung Fu Master.

Image Credit: Google AI

SimPLe does exactly this. It takes four frames as input to predict the next frame along with the reward, and after it’s fully trained, it produces “rollouts” — sample sequences of actions, observations, and outcomes — that are used to improve policies. (Kaiser and Erhan note that SimPLe only uses medium-length rollouts to minimize prediction errors.)

In experiments lasting the equivalent of two hours of gameplay (100,000 interactions), agents with SimPLe-tuned policies managed to achieve the maximum score in two test games (Pong and Freeway) and generate “near-perfect predictions” up to 50 steps into the future. They occasionally struggled to capture “small but highly relevant” objects in games, resulting in failure cases, and Kaiser and Erhan concede that it doesn’t yet match the performance of standard RL methods. But SimPLe was up to two times more efficient in terms of training, and the research team expects future work will improve its performance measurably.

“The main promise of model-based reinforcement learning methods is in environments where interactions are either costly, slow or require human labeling, such as many robotics tasks,” they wrote. “In such environments, a learned simulator would enable a better understanding of the agent’s environment and could lead to new, better and faster ways for doing multi-task reinforcement learning.”