OpenAI's Procgen Benchmark prevents AI model overfitting

Where the training of machine learning models is concerned, there's always a risk of overfitting -- or corresponding too closely -- to a particular set of data. In point of fact, it's not infeasible that popular machine learning benchmarks like the Arcade Learning Environment encourage overfitting, in that they have a low emphasis on generalization.

That's why OpenAI -- the San Francisco-based research firm cofounded by CTO Greg Brockman, chief scientist Ilya Sutskever, and others -- today released the Procgen Benchmark, a set of 16 procedurally generated environments (CoinRun, StarPilot, CaveFlyer, Dodgeball, FruitBot, Chaser, Miner, Jumper, Leaper, Maze, BigFish, Heist, Climber, Plunder, Ninja, and BossFight) that measure how quickly a model learns generalizable skills. It builds atop the startup's CoinRun toolset, which used procedural generation to construct sets of training and test levels.

"We want the best of both worlds: a benchmark comprised of many diverse environments, each of which fundamentally requires generalization," wrote OpenAI in a blog post. "To fulfill this need, we have created Procgen Benchmark ... [which strives] for all of the following: experimental convenience, high diversity within environments, and high diversity across environments ... CoinRun now serves as the inaugural environment in Procgen Benchmark, contributing its diversity to a greater whole."

According to OpenAI, Procgen environments were designed with a large amount of freedom (subject to basic design constraints) so as to present AI-driven agents with "meaningful" generalization challenges. They were also calibrated to ensure baseline agents make significant progress after training for 200 million time steps, and to perform thousands of steps per second on as little as a single processor core.

Additionally, Procgen environments support two "well-calibrated" difficulty settings: easy and hard. (The former targets users with limited access to compute power, as it requires roughly an eighth of the resources to train.) And they mimic the style of a number of Atari and Gym Retro games, in keeping with precedent.

According to OpenAI, AI model performance generally improves as the training set grows. "We believe this increase in training performance comes from an implicit curriculum provided by a diverse set of levels," the blog authors explain. "A larger training set can improve training performance if the agent learns to generalize even across levels in the training set."

OpenAI leaves to future work more complex settings, which it believes will inform more capable and efficient AI models. "[The] vast gap between training and test performance is worth highlighting. It reveals a crucial hidden flaw in training on environments that follow a fixed sequence of levels," wrote OpenAI.

OpenAI previously released Neural MMO, a "massively multiagent" virtual training ground that plops agents in the middle of an RPG-like world, and Gym, a proving ground for algorithms for reinforcement learning (which involves training machines to do things based on trial and error). More recently, it made available SafetyGym, a suite of tools for developing AI that respects safety constraints while training, and for comparing the "safety" of algorithms and the extent to which those algorithms avoid mistakes while learning.