Artificial intelligence (AI) can generate synthetic scans of brain cancer, simultaneously translate between languages, and teach robots to manipulate objects with humanlike dexterity. And as new research from OpenAI reveals, it’s pretty darn good at playing video games, too.
On Tuesday, OpenAI — a nonprofit, San Francisco-based AI research company backed by Elon Musk, Reid Hoffman, and Peter Thiel, among other tech luminaries — detailed in a research paper AI that can best humans at the retro platformer Montezuma’s Revenge. The top-performing iteration found 22 of the 24 rooms in the first level, and occasionally discovered all 24.
It follows news in June of an OpenAI-developed bot that can defeat skilled teams in Valve’s Dota 2.
As OpenAI noted in an accompanying blog post, Montezuma’s Revenge is notoriously difficult for machine learning algorithms to master. It was the only Atari 2600 title to foil Google subsidiary DeepMind’s headline-grabbing Deep Q-Learning network in 2015, which scored a 0 percent of the average human score (4.7K).
“Simple exploration strategies are highly unlikely to gather any rewards, or see more than a few of the 24 rooms in the level,” OpenAI wrote. “Since then, advances in Montezuma’s Revenge have been seen by many as synonymous with advances in exploration.”
OpenAI calls its method Random Network Distillation (RND), and said it’s designed to be applied to any reinforcement learning algorithm — i.e., models that use systems of rewards and punishments to drive AI agents in the direction of specific goals.
Traditionally, agents learn the next-state predictor model from their experiences and use the error of the prediction as an intrinsic reward. Unlike prior methods, RND introduces a bonus reward that’s based on predicting the output of a fixed and randomly initialized neural network on the next state.
In the course of a run, the agents played Montezuma’s Revenge completely randomly, improving their strategy through trial and error. Thanks to the RND component, they were incentivized to explore areas of the game map they might not have otherwise, managing to achieve the game’s objective even when it wasn’t explicitly communicated.
“Curiosity drives the agent to discover new rooms and find ways of increasing the in-game score, and this extrinsic reward drives it to revisit those rooms later in the training,” OpenAI explained. “Curiosity gives us an easier way to teach agents to interact with any environment, rather than via an extensively engineered task-specific reward function that we hope corresponds to solving a task. An agent using a generic reward function not specific to the particulars of an environment can acquire a basic level of competency in a wide range of environments, resulting in the agent’s ability to determine what behaviors are even in the absence of carefully engineered rewards.”
RND addressed another common issue in reinforcement learning schemes: the so-called noisy TV problem, in which an AI agent can become stuck looking for patterns in random data (like static on a TV).
“Like a gambler at a slot machine attracted to chance outcomes, the agent sometimes gets trapped by its curiosity,” OpenAI wrote. “The agent finds a source of randomness in the environment and keeps observing it, always experiencing a high intrinsic reward for such transitions.”
So how’d it perform? On average, OpenAI’s agents scored 10K over nine runs with a best mean return of 14.5K. A longer-running test yielded a run that achieved 17.5K, corresponding to passing the first level and finding all 24 rooms.
It wasn’t just Montezuma’s Revenge they mastered. When set loose on Super Mario, the agents discovered 11 levels, found secret rooms, and defeated bosses. They learned how to beat Breakout after a few hours of training. And when tasked with volleying a ball in Pong with a human player, they tried to prolong the game rather than win.
OpenAI has its fingers in a number of AI pies besides gaming.
Last year, it developed software that produces high-quality datasets for neural networks by randomizing the colors, lighting conditions, textures, and camera settings in simulated scenes. (Researchers used it to teach a mechanized arm to remove a can of Spam from a table of groceries.) More recently, in February, it released Hindsight Experience Replay (HER), an open source algorithm that effectively helps robots to learn from failure. And in July, it unveiled a system that directs robot hands in grasping and manipulating objects with state-of-the-art precision.