Reinforcement learning — a machine learning training technique that uses rewards to drive AI agents toward certain goals — is a reliable means of improving said agents’ decision-making, given plenty of compute, data, and time. But it’s not always practical; model-free approaches, which aim to get agents to directly predict actions from observations about their world, can take weeks of training.

Model-based reinforcement learning is a viable alternative — it has agents come up with a general model of their environment they can use to plan ahead. But in order to accurately forecast actions in unfamiliar surroundings, those agents have to formulate rules from experience. Toward that end, Google in collaboration with DeepMind today introduced the Deep Planning Network (PlaNet) agent, which learns a world model from image inputs and leverages it for planning. It’s able to solve a variety of image-based tasks with up to 5,000 percent the data efficiency, Google says, while maintaining competitiveness with advanced model-free agents.

The source code is available on GitHub.

As Danijar Hafner, a coauthor of the academic paper describing PlaNet’s architecture and a student researcher at Google AI, explains, PlaNet works by learning dynamics models given image inputs, and plans with those models to gather new experience. It specifically leverages a latent dynamics model — a model that predicts the latent state forward, and which produces an image and reward at each step from the corresponding latent state — to gain an understanding of abstract representations such as the velocities of objects. The PlaNet agent learns through this predictive image generation, and it plans quickly; in the compact latent state space, it only needs to project future rewards, not images, to evaluate an action sequence.

In contrast to previous approaches, PlaNet effectively works without a policy network — instead, it chooses actions based on planning. “For example,” Hafner said, “the agent can imagine how the position of a ball and its distance to the goal will change for certain actions, without having to visualize the scenario. This allows us to compare 10,000 imagined action sequences with a large batch size every time the agent chooses an action. We then execute the first action of the best sequence found and replan at the next step.”

Google PlaNet

Above: PlaNet learning tasks in a simulated environment.

Image Credit: Google

Google says that in tests where PlaNet was tasked with six continuous control tasks — including a task involving a simulated robot lying on the ground that had to learn to stand up and walk, and a task that called for a model that could predict multiple futures — it outperformed (or came close to outperforming) model-free methods like A3C and D4PG on image-based tasks. Moreover, when PlaNet was placed randomly into different environments without knowing the task, it managed to learn all six tasks without modification in as little as 2,000 attempts. (Previous agents that don’t learn a model of the environment sometimes require 50 times as many attempts to reach comparable performance.)

Hafner and coauthors believe that scaling up the processing power could produce an even more robust model.

“Our results showcase the promise of learning dynamics models for building autonomous reinforcement learning agents,” he wrote. “We advocate for further research that focuses on learning accurate dynamics models on tasks of even higher difficulty, such as 3D environments and real-world robotics tasks … We are excited about the possibilities that model-based reinforcement learning opens up, including multi-task learning, hierarchical planning and active exploration using uncertainty estimates.”