In a study published on the preprint server Arxiv.org, DeepMind researchers describe a reinforcement learning algorithm-generating technique that discovers what to predict and how to learn it by interacting with environments. They claim the generated algorithms perform well on a range of challenging Atari video games, achieving “non-trivial” performance indicative of the technique’s generalizability.

Reinforcement learning algorithms — algorithms that enable software agents to learn in environments by trial and error using feedback — update an agent’s parameters according to one of several rules. These rules are usually discovered through years of research, and automating their discovery from data could lead to more efficient algorithms, or algorithms better adapted to specific environments.

DeepMind’s solution is a meta-learning framework that jointly discovers what a particular agent should predict and how to use the predictions for policy improvement. (In reinforcement learning, a “policy” defines the learning agent’s way of behaving at a given time.) Their architecture — learned policy gradient (LGP) — allows the update rule (that is, the meta-learner) to decide what the agent’s outputs should be predicting while the framework discovers rules via multiple learning agents, each of which interacts with a different environment.

In experiments, the researchers evaluated the LPG directly on complex Atari games including Tutankham, Breakout, and Yars’ Revenge. They found that it generalized to the games “reasonably well” when compared with existing algorithms, despite the fact the training environments consisted of environments with basic tasks much simpler than Atari games. Moreover, the agents trained with the LPG managed to achieve “superhuman” performance on 14 games without relying on hand-designed reinforcement learning components.

The coauthors noted that LPG still lags behind some advanced reinforcement learning algorithms. But during the experiments, its generalization performance improved quickly as the number of training environments grew, suggesting it might be feasible to discover a general-purpose reinforcement learning algorithm once a larger set of environments are available for meta-training.

“The proposed approach has the potential to dramatically accelerate the process of discovering new reinforcement learning algorithms by automating the process of discovery in a data-driven way. If the proposed research direction succeeds, this could shift the research paradigm from manually developing reinforcement learning algorithms to building a proper set of environments so that the resulting algorithm is efficient,” the researchers wrote. “Additionally, the proposed approach may also serve as a tool to assist reinforcement learning researchers in developing and improving their hand-designed algorithms. In this case, the proposed approach can be used to provide insights about what a good update rule looks like depending on the architecture that researchers provide as input, which could speed up the manual discovery of reinforcement learning algorithms.”