Last August, Microsoft Research Asia detailed an AI system dubbed Super Phoenix (Suphx for short) that could defeat Mahjong players after learning from only 5,000 matches. A revised preprint paper out this week delves a bit deeper, revealing that Suphx — whose performance improved with additional training — is now rated above 99.99% of all ranked human players on Tenhou, a Japan-based global online Mahjong competition platform with over 350,000 members.
Building superhuman programs for games is a longstanding goal of the AI research community — and not without good reason. Games are an analog of the real world, with a measurable objective, and they can be played an infinite amount of times across hundreds (or thousands) of powerful machines. Moreover, its researchers assert that the learnings are applicable to other domains, like the enterprise, where mundane but cognitively demanding tasks impact workers’ productivity.
“Most real-world problems such as finance market predication and logistic optimization share the same characteristics with Mahjong — i.e., complex operation/reward rules, imperfect information,” wrote the paper’s coauthors. “We believe our techniques designed in Suphx for Mahjong, including global reward prediction, oracle guiding, and … policy adaptation have great potential to benefit for a wide range of real-world applications.”
The paper’s coauthors note that Mahjong is an imperfect information game with complicated scoring rules. The loss of one round doesn’t mean a player played poorly; they might tactically lose to ensure they secure the top rank. Plus, Mahjong has a huge number of possible winning hands, and different winning hands result in different winning scores for each round. Taking into account the up to 13 game tiles in each person’s hand, the 14 tiles in the “dead” wall visible throughout the game, and the 70 tiles in the “live” wall that becomes visible once the tiles are drawn and discarded, on average there are more than 1048 hidden states, indistinguishable to players, at any one time.
For these reasons, it’s hard for a Mahjong player — let alone a machine learning model — to decide which moves to make based on private tiles alone. Cognizant of this, the team built Suphx to tackle 4-player Japanese Mahjong (Riichi Mahjong), which has one of the largest Mahjong communities in the world.
Suphx comprises a family of convolutional neural networks, a type of AI model commonly applied to computer vision, and it learns five models to handle different scenarios: the discard, Riichi, Chow, Pong, and Kong models. Based on these, Suphx employs another rule-based model to decide whether to declare a winning hand and take the round, checking whether a winning hand can be formed from a tile discarded by other players or drawn from the wall.
The researchers had to design a set of features to encode game information into channels that could be “digested” by the models, including one for each of the 34 tiles in Japanese Mahjong and four for private player tiles. They also hand-crafted over 100 look-ahead features to indicate the probability and round score of a winning hand if a specific tile was discarded and then a tile from the wall was drawn.
Suphx had a three-step training process. First, all five of its models were trained using the logs of top human players collected from Tenhou’s platform. Then, they were fine-tuned via self-play reinforcement learning, using self-play workers containing a set of CPU-based Mahjong simulators and trajectory-generating GPU-based inference engines. Finally, during online play, run-time policy adaptation is used to leverage observations on the current round to make the system perform even better.
In the reinforcement learning step, every Mahjong simulator randomly initialized a game with Suphx as a player and three other AI opponents. When any of the four players needed to take an action, the simulator sent the current state to the GPU inference engine, which then returned an action to the simulator. Meanwhile, the inference engines pulled the up-to-date policy to ensure that the self-play policy didn’t diverge from the latest policy.
A global reward predictor trained on player log data provided a reward signal by predicting the final game reward, given information about the current round and all previous rounds of the game. It was complemented by an “oracle” agent that sped up training during self-play by training on all perfect information about a state (including players’ private tiles and the tiles in the wall) and gradually discarding those features until it became a “normal” agent.
Suphx continually improves courtesy of an offline-trained policy, which randomly samples private tiles for three opponents and wall times from the pool of tiles (excluding the system’s own tiles) and then generates trajectories. Policy adaptation is performed for each round independently, and it restarts for each subsequent round.
The team evaluated Suphx on 20 Nvidia Tesla K80 GPUs, sampling 800,000 games from a data set of over a million games exactly 1,000 times. Prior to the experiments, they trained each model using 1.5 million games on 44 GPUs (4 Nvidia Titan XPs for the parameter server and 40 K80s for the self-play workers) over the course of two days.
After playing over 5,760 games against human players on Tenhou, Suphx achieved 10 dan in terms of record — something roughly only 180 players have ever done — and 8.74 dan in terms of stable rank (versus top human players’ 7.4). Anecdotally, the researchers report that Suphx is “very strong” at defense and has very low deal-in rate (10.06%), and that it developed its own playing styles that keep tiles safe and win with half-flushes.
“Looking forward, we will introduce more novel technologies to Suphx, and continue to push the frontier of Mahjong AI and imperfect-information game playing,” said the paper’s coauthors.