AI Weekly: AI research still has a reproducibility problem

Many systems like autonomous vehicle fleets and drone swarms can be modeled as Multi-Agent Reinforcement Learning (MARL) tasks, which deal with how multiple machines can learn to collaborate, coordinate, compete, and collectively learn. It's been shown that machine learning algorithms -- particularly reinforcement learning algorithms -- are well-suited to MARL tasks. But it's often challenging to efficiently scale them up to hundreds or even thousands of machines.

One solution is a technique called centralized training and decentralized execution (CTDE), which allows an algorithm to train using data from multiple machines but make predictions for each machine individually (e.g., like when a driverless car should turn left). QMIX is a popular algorithm that implements CTDE, and many research groups claim to have designed QMIX algorithms that perform well on difficult benchmarks. But a new paper claims that these algorithms' improvements might only be the result of code optimizations or "tricks" rather than design innovations.

In reinforcement learning, algorithms are trained to make a sequence of decisions. AI-guided machines learn to achieve a goal through trial and error, receiving either rewards or penalties for the actions they perform. But "tricks" like learning rate annealing, which has an algorithm first train quickly before slowing down the process, can yield misleadingly competitive performance results on benchmark tests.

In experiments, the coauthors tested proposed variations of QMIX on the Starcraft Multi-Agent Challenge (SMAC), which focuses on micromanagement challenges in Activision Blizzard's real-time strategy game StarCraft II. They found that QMIX algorithms from teams at the University of Virginia, the University of Oxford, and Tsinghua University managed to solve all of SMAC's scenarios when using a list of common tricks, but that when the QMIX variants were normalized, their performance was significantly worse.

One QMIX variant, LICA, was trained on substantially more data than QMIX, but in their research, the creators compared its performance to a "vanilla" QMIX model without code-level optimizations. The researchers behind another variant, PLEX, used test results from version 2.4.10 of SMAC to compare the results of QMIX on version 2.4.6, which is known to be more difficult than 2.4.10.

"[S]ome of the things mentioned are endemic among machine learning, like cherrypicking results or having inconsistent comparisons to other systems. It's not 'cheating' exactly (or at least, sometimes it's not) as much as it is just lazy science that should be picked up by someone reviewing. Unfortunately, peer review is a pretty lax process," an AI researcher at Queen Mary University of London, told VentureBeat via email.

In a Reddit thread discussing the study, one user argues that the results point to the need for ablation studies, which remove components of an AI system one-by-one to audit their performance. The problem is that large-scale ablations can be expensive in the reinforcement learning domain, the user points out, because they require a lot of compute power.

More broadly, the findings underline the reproducibility problem in AI research. Studies often provide benchmark results in lieu of source code, which becomes problematic when the thoroughness of the benchmarks is in question. One recent report found that 60% to 70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were often simply memorizing answers. Another study -- a meta-analysis of over 3,000 AI papers -- found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

"In some ways the general state of reproduction, validation, and review in computer science is pretty appalling. And I guess that broader issue is quite serious given how this field is now impacting people's lives quite significantly," Cook continued.

Reproducibility challenges

In a 2018 blog post, Google engineer Pete Warden spoke to some of the core reproducibility issues that data scientists face. He referenced the iterative nature of current approaches to machine learning and the fact that researchers aren't easily able to record their steps through each iteration. Slight changes in factors like training or validation datasets can affect performance, he pointed out, making the root cause of differences between expected and observed results difficult to suss out.

"If [researchers] can’t get the same accuracy that the original authors did, how can they tell if their new approach is an improvement? It’s also clearly concerning to rely on models in production systems if you don’t have a way of rebuilding them to cope with changed requirements or platforms," Warden wrote. "It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes."

Data scientists like Warden say that AI research should be presented in a way that third parties can step in, train the novel models, and get the same results with a margin of error. In a recent letter published in the journal Nature -- a response to an algorithm detailed by Google in 2020 -- the coauthors lay out a number of expectations for reproducibility, including descriptions of model development, data processing, and training pipelines; open-sourced code and training datasets, or at least model predictions and labels; and a disclosure of the variables used to augment the training dataset, if any. A failure to include these "undermines [the] scientific value" of the research, they say.

"Researchers are more incentivized to publish their finding rather than spend time and resources ensuring their study can be replicated … Scientific progress depends on the ability of researchers to scrutinize the results of a study and reproduce the main finding to learn from," reads the letter. "Ensuring that [new] methods meet their potential ... requires that [the] studies be reproducible."

For AI coverage, send news tips to Kyle Wiggers -- and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.

Thanks for reading,

Kyle Wiggers

AI Staff Writer