DeepMind wants to teach robots to play board games

Mastering physical systems with abstract goals is an unsolved challenge in AI. To encourage the development of techniques that might overcome it, researchers at DeepMind created custom scenarios for the physics engine MuJoCo that task an AI agent with coordinating perception, reasoning, and motor control over time. They believe that the library, which they've made publicly available, can help bridge the gap between abstract planning and embodied control.

Recent work in machine learning has led to algorithms capable of mastering board games such as Go, chess, and shogi. These algorithms observe the states of games and control these states directly with their actions, unlike humans, who don't just reason about the moves but look at the board and physically manipulate the game pieces with their fingers. Beyond games, many problems in the real world require a combination of perception, planning, and execution, which even leading algorithms mostly fail to capture.

The team's solution is a set of challenges that embed tasks from games (e.g., tic-tac-toe, Sokoban) into environments where agents must control a physical body to execute moves. For example, to place a single tic-tac-toe piece, an agent has to reach the board with a 9-degree-of-freedom arm and touch the corresponding place on that board. Learning to play tic-tac-toe and executing a reaching movement are well within the capabilities of current AI approaches, but most agents struggle when they're faced with both problems at once.

In MuJoBan, which is based on Sokoban, an agent situated on a grid has to push boxes onto target locations. Only one box can be pushed at a time and boxes can only be pushed, not pulled. MuJoXo is akin to tic-tac-toe, with randomness to ensure pieces aren't aligned perfectly on the board. The last game, MuJoGo, is a 7-by-7 Go board designed to be solved in roughly 50 moves (2.5 seconds).

In experiments, the researchers designed example agents to complete various game tasks. The agents employed a planner module to map ground truth game states to target states as well as plot out the actions needed to reach them. They also leveraged an auxiliary task to encourage agents to follow instructions, such that an agent received a reward when it executed actions that resulted in the game moves suggested by the instructions. (A "reward" refers to positive feedback that reinforces desirable behaviors -- or game moves, as the case may be.)

The researchers report that the agents were unable to solve more than half of the levels in MuJoBan after extensive training, which they blame on a combination of multistep reasoning and control challenges. The simplest agent required around a million games before it could play MuJoXo "convincingly," and it didn't show any sign of progress in MuJoGo even after billions of steps of training.

"Problems that require reasoning and decision making over long time scales using sensoriomotor control cannot yet be solved in an end-to-end fashion. These problems arise frequently in human behavior but are still hard to frame and rarely studied in a controlled experimental setting," the researchers wrote in a paper describing the work. "We hope that the environments provided here will spur research into how to coherently introduce these capacities into the next generation of AI agents."

All three scenarios are available on GitHub.

More