Machines learning to play games by watching humans might sound like the plot of a science fiction novel, but that’s exactly what researchers at OpenAI — a nonprofit, San Francisco-based AI research company backed by Elon Musk, Reid Hoffman, and Peter Thiel, among other tech luminaries — and Google subsidiary DeepMind claim to have accomplished.
In a paper published on the preprint server Arxiv.org (“Reward learning from human preferences and demonstrations in Atari”), they describe an AI system that combines two approaches to learning from human feedback: expert demonstrations and trajectory preferences. Their deep neural network — which, like other neural networks, consists of mathematical functions loosely modeled on neurons in the brain — achieved superhuman performance on two out of the nine Atari games tested (Pong and Enduro) and beat baseline models in seven.
The research was submitted to the Neural Information Processing Systems (NIPS 2018), which is scheduled to take place in Montreal, Canada during the first week in December.
“To solve complex real-world problems with reinforcement learning, we cannot rely on manually specified reward functions,” the team wrote. “Instead, we can have humans communicate an objective to the agent directly.”
It’s a technique that’s been referred to in prior research as “inverse reinforcement learning,” and it holds promise for tasks involving poorly defined objectives that tend to trip up artificially intelligent (AI) systems. As the paper’s authors noted, reinforcement learning — which uses a system of rewards (or punishments) to drive AI agents to achieve specific goals — isn’t of much use if the goals in question lack feedback mechanisms.
Game-playing agents created by the researchers’ AI model didn’t merely mimic human behavior. If they had, they wouldn’t have been particularly scalable, because they would have required a human expert to teach them how to perform specific tasks and never would have be able to achieve “significantly” better performance than said experts.
The researchers’ system combined several forms of feedback, including imitation learning from expert demonstrations and a reward model that used trajectory preferences. Basically, it didn’t assume a directly available reward, such as an increase in score or an in-game bonus; instead, relying on feedback from a human in the loop, it attempted to approximate as closely as possible intended behavior by (1) imitating it from demonstrations and (2) maximizing the inferred reward function.
The model consisted of two parts: a deep Q-Learning network, which DeepMind tapped in prior research to achieve superhuman performance in Atari 2600 games, and a reward model, a convolutional neural network trained on labels supplied by an annotator — either a human or a synthetic system — during task training.
Agents learned over time both from the demonstrations and from experience. All the while, human experts prevented them from exploiting unexpected sources of reward that could harm performance, a phenomenon known as reward hacking.
In testing, the researchers set agents from the AI model on the Arcade Learning Environment, an open source framework for designing AI agents that can play Atari 2600 games. Atari games, the researchers wrote, have the advantage of being “among the most diverse environments” for reinforcement learning and provide “well-specified” reward functions.
After 50 million steps and a full schedule of 6,800 labels, the agents trained with the researchers’ system outperformed imitation learning baselines in all games tested except Private Eye (including Beamrider, Breakout, Enduro, Pong, Q*bert, and Seaquest). Human demonstrations benefited Hero, Montezuma’s Revenge, and Private Eye greatly, the researchers found, and typically halved the amount of human time required to achieve the same level of performance.
The research follows on the heels of an AI system — also the work of OpenAI scientists — that can best humans at Montezuma’s Revenge. (Most of that model’s performance improvements came from random network distillation, which introduced a bonus reward that’s based on predicting the output of a fixed and randomly initialized neural network on the next state.) When set loose on Super Mario, agents trained by the system discovered 11 levels, found secret rooms, and defeated bosses. And when tasked with volleying a ball in Pong with a human player, they tried to prolong the game rather than win.
It also comes after news in June of an OpenAI-developed bot that can defeat skilled teams in Valve’s Dota 2.