Plan2Explore adapts to exploration tasks without fine-tuning

In a paper published this week on the preprint server Arxiv.org, researchers affiliated with Google, Microsoft, Facebook, Carnegie Mellon, the University of Toronto, the University of Pennsylvania, and the University of California, Berkeley propose Plan2Explore, a self-supervised AI that leverages planning to tackle previously unknown goals. Without human supervision during training, the researchers claim, it outperforms prior methods, even in the absence of any task-specific interaction.

Self-supervised learning algorithms like Plan2Explore generate labels from data by exposing relationships between the data's parts, unlike supervised learning algorithms that train on expertly annotated data sets. They observe the world and interact with it a little bit, mostly by observation in a test-independent way, in much the way an animal might. Turing Award winners Yoshua Bengio and Yann LeCun believe self-supervision is the key to human-level intelligence, and Plan2Explore puts it into practice -- it learns to complete new tasks without specifically training on those tasks.

Plan2Explore explores an environment and summarizes its experiences into a representation that enables the prediction of thousands of scenarios in parallel. (A scenario describes what would happen if the agent were to execute a sequence of actions -- for example, turning left into a hallway and then crossing the room.) Given this world model, Plan2Explore derives behaviors from it using Dreamer, a DeepMind-designed algorithm that plans ahead to select actions by anticipating their long-term outcomes. Then, Plan2Explore receives reward functions -- functions describing how the AI ought to behave -- to adapt to multiple tasks such as standing, walking, and running, using either zero or few tasks-specific interactions.

To ensure it remains computationally efficient, Plan2Explore quantifies the uncertainty about its various predictions. This encourages the system to seek out areas and trajectories within the environment with high uncertainty, upon which Plan2Explore trains to reduce the prediction uncertainties. The process is repeated so that Plan2Explore optimizes from trajectories it itself predicted.

In experiments within the DeepMind Control Suite, a simulated performance benchmark for AI agents, the researchers say that Plan2Explore managed to accomplish goals without using goal-specific information -- that is, using only the self-supervised world model and no new interactions with the outside world. Plan2Explore also performed better than prior leading exploration strategies, sometimes being the only successful unsupervised method. And it demonstrated its world model was transferable to multiple tasks in the same environment; in one example, a cheetah-like agent ran backward, flipped forward, and flipped backward.

"Reinforcement learning allows solving complex tasks; however, the learning tends to be task-specific and the sample efficiency remains a challenge," wrote the coauthors. "By presenting a method that can learn effective behavior for many different tasks in a scalable and data-efficient manner, we hope this work constitutes a step toward building scalable real-world reinforcement learning systems."

Plan2Explore's code is available on GitHub.

More