Two years ago, Nvidia researchers detailed AI that could generate visuals and combine them with a game engine to create an interactive, navigable environment. In something of a follow-up to that work, scientists at the company, the Vector Institute, MIT, and the University of Toronto this week published a paper describing GameGAN, a system that can synthesize a functional version of a game without an underlying engine.
While game generation on its face might not seem like the most practical application of AI, algorithms like GameGAN could be one day used to produce simulators for training robotic systems. Before it’s deployed to the real world, robot-controlling AI typically undergoes extensive testing in simulated environments, which comprise procedural models that synthesize scenes and behavior trees specifying in-simulation agents’ behaviors. Writing these models and trees requires both time and highly skilled domain experts, which translates to an uptick in expenditures by companies looking to transfer models to real-world robots.
It bears mentioning that GameGAN isn’t the first system designed to tackle game generation. A recent paper coauthored by Google Brain researchers describes an algorithm that uses video prediction techniques to train game-playing AI within learned models of Atari games. A Georgia Tech study proposes an algorithm that absorbs game footage and probabilistically maps relationships between in-game objects and how they change. Facebook’s system can extract controllable characters from real-world videos of tennis players, fencing instructors, and more. And systems like those proposed by researchers at the University of California, Santa Barbara and the Politecnico di Milano in Italy draw on knowledge of existing stages to create new stages in games like Doom and Super Mario Bros.
But GameGAN uniquely frames game creation as an image generation problem. Given sequences of frames from a game and the corresponding actions agents (i.e., players) within the game took, the system visually imitates the game using a trained AI model. Concretely, GameGAN ingests screenplay and keyboard actions during training and aims to predict the next frame by conditioning on the action — for example, a button pressed by a player. It learns from image and action pairs directly without having access to the underlying logic or engine, leveraging a memory module that encourages the system to build a map of the game environment. A decoder learns to disentangle static and dynamic components within frames, making the behavior of GameGAN more interpretable, and it allows existing games to be modified on the fly by swapping out various assets.
Accomplishing this required overcoming formidable design challenges on the part of the researchers, like emulating physics engines and preserving long-term consistency. (Players generally expect a scene they navigate away from to look the same if they return.) They also had to ensure GameGAN could model the deterministic (predictable) and stochastic (random) behaviors within games it tried to recreate.
A model in three parts
The team’s solution was a three-part model consisting of a dynamics engine, the aforementioned memory module, and a rendering engine. At a high level, GameGAN responds to the actions of an AI agent playing the generated game by producing frames of the environment in real time, even layouts it’s never seen before.
The dynamics engine is responsible for learning which actions aren’t “permissible” in the context of a game (like walking through a wall) and for modeling how objects respond as a consequence of actions. The memory module establishes long-term consistency so that simulated scenes (like buildings and streets) don’t change unexpectedly over time, in part by “remembering” every generated scene. (The memory model also retrieves static elements such as backgrounds as they’re needed.) The rendering engine — the last step in the pipeline — renders simulated images given object and attribute maps, accounting for depth automatically by occluding objects.
GameGAN trains using a so-called adversarial approach, where the system attempts to “fool” discriminators — a single-image discriminator, an action-conditioned discriminator, and a temporal discriminator — to produce realistic, coherent games. GameGAN synthesizes images from random noise samples using a distribution and then feeds them, along with real examples from a training data set, to the discriminators, which attempt to distinguish between the two. Both GameGAN and discriminators improve in their respective abilities until the discriminators are unable to tell the real examples from the synthesized examples with better than the 50% accuracy expected of chance.
Training occurs in an unsupervised fashion, meaning that GameGAN infers the patterns within data sets without reference to known, labeled, or annotated outcomes. Interestingly, the discriminators’ work informs that of GameGAN — every time the discriminator correctly identifies a synthesized work, it tells GameGAN how to tweak its output so that it might be more realistic in the future.
In experiments, the Nvidia team fed GameGAN 50,000 episodes (several million frames in total) of Pac-Man and the Doom-based AI research platform VizDoom over the course of four days. (Bandai Namco’s research division provided a copy of Pac-Man for training.) They used a modified version of Pac-Man with an environment half the normal size (a 7-by-7 grid as opposed to a 14-by-14 grid) as well as a variation dubbed Pac-Man-Maze, which lacked ghosts and had walls randomly created by an algorithm.
Excepting the occasional failure case, GameGAN indeed delivered “temporally consistent” Pac-Man- and Doom-like experiences complete with ghosts and pellets (in the case of the Pac-Man imitation) and fireballs and rooms (VizDoom).
Perhaps more excitingly, because of its disentanglement step, the system allowed enemies within the simulated games to be moved around the map and backgrounds or foregrounds to be swapped with random images.
In an attempt to measure the generated games’ qualities more quantitatively, the researchers deployed reinforcement learning agents within both games and tasked them with achieving high scores. For instance, the Pac-Man agent had to “eat” pellets and capture a flag and was penalized each time a ghost consumed it or it used a maximum number of steps. Over the course of 100 test environments, the agents solved the VizDoom-like game — making them the first trained with a GAN framework to do so, claims the team — and beat several baselines in Pac-Man.
The researchers believe GameGAN has obvious applicability to game design, where it could be used alongside tools like Promethean AI‘s art-generating platform to quickly create new levels and environments. But they also envision future, similar systems that can learn to mimic the rules of driving, for instance, or the laws of physics just by watching videos and seeing agents take actions. In the nearer term, as alluded to earlier, GameGAN could write simulators to train warehouse robots that can grasp and move objects around or delivery robots that must traverse sidewalks to deliver food and medicine.
Nvidia says it’ll make the generated games from its experiments available on its AI Playground platform later this year.