Few games are simpler in principle than capturing the flag (excepting perhaps tag or kick the can). Two teams each have a marker located at their respective bases, and the objective is to capture the other team’s marker and return it safely back to their base. Easy peasy.

What’s easily understood by humans is not quite so quickly grasped by machines, though. Where capture the flag is concerned in the video game domain, non-player characters have traditionally been programmed with heuristics and rules affording limited freedom in choice.

But AI and machine learning promise to turn this paradigm on its head. In a paper published this week in the journal Science roughly a year following the preprintresearchers at DeepMind, the London-based subsidiary of Google parent company Alphabet, describe a system capable not only of learning how to play capture the flag in Id Software’s Quake III Arena, but of devising entirely novel human-level team-based strategies.

“No one has told [the AI] how to play the game — only if they’ve beaten their opponent or not. The beauty of using [an] approach like this is that you never know what kind of behaviors will emerge as the agents learn,” said Max Jaderberg, a research scientist at DeepMind who recently worked on AlphaStar, a machine learning system that recently bested a human team of professionals at StarCraft II. He further explained that the key technique at play is reinforcement learning, which employs rewards to drive software policies toward goals — in the DeepMind agents’ case, whether their team won or not.

DeepMind Quake III Arena

“From a research perspective, it’s the novelty of the algorithmic approach that’s really exciting,” he said. “The specific way we trained our [AI] … is a good example of how to scale up and operationalize some classic evolutionary ideas.”

DeepMind’s cheekily-dubbed For The Win (FTW) agents learn directly from on-screen pixels using a convolutional neural network, a collection of mathematical functions (neurons) arranged in layers modeled after the visual cortex. The ingested data is passed onto two recurrent long short-term memory (LSTM) networks, or networks capable of learning long-term dependencies. One is on a fast timescale and the other operates on a slow timescale, and they’re coupled by a variational objective, a type of memory they jointly use to make predictions about the game world and output actions through an emulated game controller.

The FTW agents were trained in a population of 30 in total, which provided them with a range of teammates and opponents with which to play, and stages were selected randomly so as to prevent the agents from memorizing layouts. Each agent learned its own reward signal, enabling them to generate their own internal goals (like capturing the flag). They moreover leveraged a two-tier process to optimize their internal rewards and reinforcement learning on these rewards to suss out the overriding policies.

In all, agents individually played around 450,000 games of capture the flag, the equivalent of roughly four years of experience.

DeepMind Quake III Arena

Above: A diagram illustrating the activations in DeepMind’s AI system.

Image Credit: DeepMind

“This is a really, really powerful learning paradigm,” said Wojciech Marian Czarnecki, a research scientist at DeepMind who also contributed to AlphaStar. “You’re actually boosting performance — it looks like the multiagent aspects are actually making our life much easier in terms of succeeding in our research.”

The fully trained FTW agents, which run on commodity PC hardware, employed strategies generalizable across maps, team rosters, and team sizes. They learned humanlike behaviors such as following teammates, camping in the opponent’s base, and defending their own base from waves of attackers, and they shed less advantageous behaviors (like closely following teammates around the map) as training progressed.

So how’d the agents fare, ultimately? In a tournament involving 40 human players in which humans and agents were randomly matched in games (both as opponents and teammates), the FTW agents were more proficient than the baseline methods. In fact, they exceeded the win-rate of human players substantially, with an Elo rating (which corresponds to the probability of winning) of 1,600 compared with “strong” humans players’ 1,300 and average human players’ 1,050.

The agents had fast reaction times, unsurprisingly, which gave them a slight advantage in initial experiments. But even when their accuracy and reaction time was reduced through an inbuilt quarter-of-a-second (257-millisecond) delay, they still outperformed their human counterparts, with strong human players and intermediate players winning only 21% and 12% of the time, respectively.

DeepMind Quake III Arena

Furthermore, when the researchers set the agents loose on other Quake III Arena game types following the paper’s publication, including professionally played maps and multiplayer modes with more gadgets and pickups (like Harvester on the Future Crossings map and One Flag Capture the Flag on the Ironwood map), the agents began to challenge the skills of human researchers in test matches. And when the researchers examined the activation patterns of the agents’ neural networks — i.e., the functions of neurons responsible for defining output data given input data — they found clusters representing rooms, the status of the flags, the visibility of teammates and opponents, the presence or absence of agents in the opponents’ base or team base, and other “meaningful aspects” of gameplay.

The trained agents even contained neurons that coded directly for particular situations, like when the agent’s flag is taken or when an agent’s teammate is holding a flag. “I think one of the things to note is that these ideas, these multiagent domains, are exceptionally powerful, and this paper shows us that,” said Jaderberg. “I think that’s what we’re learning better and better over the last couple of years — how to construct the problem of reinforcement learning. Reinforcement learning really shines in new situations.”

Thore Graepel, a professor of computer science at London’s Global University and a scientist at DeepMind, says that the work highlights the potential of multiagent training to advance the development of AI. It might inform, for example, research in human-machine interaction and systems that complement one another or work together.

“Our results demonstrate that multiagent reinforcement learning can successfully tackle a complex game to the point that human players even think computer players are better teammates. They also provide a fascinating in-depth analysis of how the trained agents behave, work together, and represent their environment,” he said. “What makes these results so exciting is that these agents perceive their environment from a first-person perspective, just as a human player would. In order to learn how to play tactically and collaborate with their teammates, these agents must rely on feedback from the game outcomes — without any teacher or coach showing them what to do.”