MIT study finds humans struggle when partnered with RL agents

Artificial intelligence has proven that complicated board and video games are no longer the exclusive domain of the human mind. From chess to Go to StarCraft, AI systems that use reinforcement learning algorithms have outperformed human world champions in recent years.

But despite the high individual performance of RL agents, they can become frustrating teammates when paired with human players, according to a study by AI researchers at MIT Lincoln Laboratory. The study, which involved cooperation between humans and AI agents in the card game Hanabi, shows that players prefer the classic and predictable rule-based AI systems over complex RL systems.

The findings, presented in a paper published on arXiv, highlight some of the underexplored challenges of applying reinforcement learning to real-world situations and can have important implications for the future development of AI systems that are meant to cooperate with humans.

Finding the gap in reinforcement learning

Deep reinforcement learning, the algorithm used by state-of-the-art game-playing bots, starts by providing an agent with a set of possible actions in the game, a mechanism to receive feedback from the environment, and a goal to pursue. Then, through numerous episodes of gameplay, the RL agent gradually goes from taking random actions to learning sequences of actions that can help it maximize its goal.

Early research of deep reinforcement learning relied on the agent being pre-trained on gameplay data from human players. More recently, researchers have been able to develop RL agents that can learn games from scratch through pure self-play without human input.

In their study, the researchers at MIT Lincoln Laboratory were interested in finding out if a reinforcement learning program that outperforms humans could become a reliable coworker to humans.

"At a very high level, this work was inspired by the question: What technology gaps exist that prevent reinforcement learning (RL) from being applied to real-world problems, not just video games?" Dr. Ross Allen, AI researcher at Lincoln Laboratory and co-author of the paper, told TechTalks. "While many such technology gaps exist (e.g., the real world is characterized by uncertainty/partial-observability, data scarcity, ambiguous/nuanced objectives, disparate timescales of decision making, etc.), we identified the need to collaborate with humans as a key technology gap for applying RL in the real world."

Adversarial vs. cooperative games

Recent research mostly applies reinforcement learning to single-player games (e.g., Atari Breakout) or adversarial games (e.g., StarCraft, Go), where the AI is pitted against a human player or another game-playing bot.

"We think that reinforcement learning is well suited to address problems on human-AI collaboration for similar reasons that RL has been successful in human-AI competition," Allen said. "In competitive domains RL was successful because it avoided the biases and assumptions on how a game should be played, instead learning all of this from scratch."

In fact, in some cases, the reinforcement systems have managed to hack the games and find tricks that baffled even the most talented and experienced human players. One famous example was a move made by DeepMind's AlphaGo in its matchup against Go world champion Lee Sedol. Analysts first thought the move was a mistake because it went against the intuitions of human experts. But the same move ended up turning the tide in favor of the AI player and defeating Sedol. Allen thinks the same kind of ingenuity can come into play when RL is teamed up with humans.

"We think RL can be leveraged to advance the state of the art of human-AI collaboration by avoiding the preconceived assumptions and biases that characterize rule-based expert systems," Allen said.

For their experiments, the researchers chose Hanabi, a card game in which two to five players must cooperate to play their cards in a specific order. Hanabi is especially interesting because while simple, it is also a game of full cooperation and limited information. Players must hold their cards backward and can't see their faces. Accordingly, each player can see the faces of their teammates' cards. Players can use a limited number of tokens to provide each other clues about the cards they're holding. Players must use the information they see on their teammates' hands and the limited hints they know about their own hand to develop a winning strategy.

"In the pursuit of real-world problems, we have to start simple," Allen said. "Thus we focus on the benchmark collaborative game of Hanabi."

In recent years, several research teams have explored the development of AI bots that can play Hanabi. Some of these agents use symbolic AI, where the engineers provide the rules of gameplay beforehand, while others use reinforcement learning.

The AI systems are rated based on their performance in self-play (where the agent plays with a copy of itself), cross-play (where the agent is teamed with other types of agents), and human-play (the agent is cooperates with a human).

"Cross-play with humans, referred to as human-play, is of particular importance as it measures human-machine teaming and is the foundation for the experiments in our paper," the researchers write.

To test the efficiency of human-AI cooperation, the researchers used SmartBot, the top-performing rule-based AI system in self-play, and Other-Play, a Hanabi bot that ranked highest in cross-play and human-play among RL algorithms.

"This work directly extends previous work on RL for training Hanabi agents. In particular we study the 'Other Play' RL agent from Jakob Foerster's lab," Allen said. "This agent was trained in such a way that made it particularly well suited for collaborating with other agents it had not met during training. It had produced state-of-the-art performance in Hanabi when teamed with other AI it had not met during training."

Human-AI cooperation

In the experiments, human participants played several games of Hanabi with an AI teammate. The players were exposed to both SmartBot and Other-Play but weren't told which algorithm was working behind the scenes.

The researchers evaluated the level of human-AI cooperation based on objective and subjective metrics. Objective metrics include scores, error rates, etc. Subjective metrics include the experience of the human players, including the level of trust and comfort they feel in their AI teammate, and their ability to understand the AI's motives and predict its behavior.

There was no significant difference in the objective performance of the two AI agents. But the researchers expected the human players to have a more positive subjective experience with Other-Play, since it had been trained to cooperate with agents other than itself.

"Our results were surprising to us because of how strongly human participants reacted to teaming with the Other Play agent. In short, they hated it," Allen said.

According to the surveys from the participants, the more experienced Hanabi players had a poorer experience with Other-Play RL algorithm in comparison to the rule-based SmartBot agent. One of the key points to success in Hanabi is the skill of providing subtle hints to other players. For example, say the "one of squares" card is laid on the table and your teammate holds the two of squares in his hand. By pointing at the card and saying "this is a two" or "this is a square," you're implicitly telling your teammate to play that card without giving him full information about the card. An experienced player would catch on the hint immediately. But providing the same kind of information to the AI teammate proves to be much more difficult.

"I gave him information and he just throws it away," one participant said after being frustrated with the Other-Play agent, according to the paper. Another said, "At this point, I don't know what the point is."

Interestingly, Other-Play is designed to avoid the creation of "secretive" conventions that RL agents develop when they only go through self-play. This makes Other-Play an optimal teammate for AI algorithms that weren't part of its training regime. But it still has assumptions about the types of teammates it will encounter, the researchers note.

"Notably, [Other-Play] assumes that teammates are also optimized for zero-shot coordination. In contrast, human Hanabi players typically do not learn with this assumption. Pre-game convention-setting and post-game reviews are common practices for human Hanabi players, making human learning more akin to few-shot coordination," the researchers note in their paper.

Implications for future AI systems

"Our current findings give evidence that an AI's objective task performance alone (what we refer to as 'self-play' and 'cross-play' in the paper) may not correlate to human trust and preference when collaborating with that AI," Allen said. "This raises the question: what objective metrics do correlate to subjective human preferences? Given the huge amount of data needed to train RL-based agents, it's not really tenable to train with humans in the loop. Therefore, if we want to train AI agents that are accepted and valued by human collaborators, we likely need to find trainable objective functions that can act as surrogates to, or strongly correlate with, human preferences."

Meanwhile, Allen warns against extrapolating the results of the Hanabi experiment to other environments, games, or domains that they have not been able to test. The paper also acknowledges some of the limits in the experiments, which the researchers are working to address in the future. For example, the subject pool was small (29 participants) and skewed toward people who were skilled in Hanabi, which implies that they had predefined behavioral expectations from the AI teammate and were more likely to have a negative experience with the eccentric behavior of the RL agent.

Nonetheless, the results can have important implications for the future of reinforcement learning research.

"If state-of-the-art RL agents can't even make an acceptable collaborator in a game as constrained and narrow scope as Hanabi; should we really expect that same RL techniques to 'just work' when applied to more complicated, nuanced, consequential games and real-world situations?" Allen said. "There is a lot of buzz about reinforcement learning within tech and academic fields; and rightfully so. However, I think our findings show that the remarkable performance of RL systems shouldn't be taken for granted in all possible applications."

For example, it might be easy to assume that RL could be used to train robotic agents capable of close collaboration with humans. But the results from the work done at MIT Lincoln Laboratory suggests the contrary, at least given the current state of the art, Allen says.

"Our results seem to imply that much more theoretical and applied work is needed before learning-based agents will be effective collaborators in complicated situations like human-robot interactions," he said.

Ben Dickson is a software engineer and the founder of TechTalks. He writes about technology, business, and politics.