Researchers find 'inconsistent' benchmarking across 3,867 AI research papers

The metrics used to benchmark AI and machine learning models often inadequately reflect those models' true performances. That's according to a preprint study from researchers at the Institute for Artificial Intelligence and Decision Support in Vienna, which analyzed data in over 3,000 model performance results from the open source web-based platform Papers with Code. They claim that alternative, more appropriate metrics are rarely used in benchmarking and that the reporting of metrics is inconsistent and unspecific, leading to ambiguities.

Benchmarking is an important driver of progress in AI research. A task (or tasks) and the metrics associated with it (or them) can be perceived as an abstraction of a problem the scientific community aims to solve. Benchmark data sets are conceptualized as fixed representative samples for tasks to be solved by a model. But while benchmarks covering a range of tasks including machine translation, object detection, or question-answering have been established, the coauthors of the paper claim some -- like accuracy (i.e., the ratio of correctly predicted samples to the total number of samples) -- emphasize certain aspects of performance at the expense of others.

In their analysis, the researchers looked at 32,209 benchmark results across 2,298 data sets from 3,867 papers published between 2000 and June 2020. They found the studies used a total of 187 distinct top-level metrics and that the most frequently used metric was "accuracy," making up 38% of the benchmark data sets. The second and third most commonly reported metrics were "precision," or the fraction of relevant instances among retrieved instances, and "F-measure," or the weighted mean of precision and recall (the fraction of the total relevant instances actually retrieved). Beyond this, with respect to the subset of papers covering natural language processing, the three most commonly reported metrics were BLEU score (for things like summarization and text generation), the ROUGE metrics (video captioning and summarization), and METEOR (question-answering).

For more than two thirds (77.2%) of the analyzed benchmark data sets, only a single performance metric was reported, according to the researchers. A fraction (14.4%) of the benchmark data sets had two top-level metrics, and 6% had three metrics.

The researchers note irregularities in the reporting of metrics they identified, like the referencing of "area under the curve" as simply "AUC." Area under the curve is a measure of accuracy that can be interpreted in different ways depending on whether it's drawn plotting precision and recall against each other (PR-AUC) or recall and the false-positive rate (ROC-AUC). Similarly, several papers referred to a natural language processing benchmark -- ROUGE -- without specifying which variant was used. ROUGE has precision- and recall-tailored subvariants, and while the recall subvariant is more common, this could lead to ambiguities when comparing results between papers, the researchers argue.

Inconsistencies aside, many of the benchmarks used in the papers surveyed are problematic, the researchers say. Accuracy, which is often used to evaluate binary and multiclass classifier models, doesn't yield informative results when dealing with unbalanced corpora exhibiting large differences in the number of instances per class. If a classifier predicts the majority class in all cases, then accuracy is equal to the proportion of the majority class among the total cases. For example, if a given "class A" makes up 95% of all instances, a classifier that predicts "class A" all the time will have an accuracy of 95%.

Precision and recall also have limitations in that they focus only on instances predicted as positive by a classifier or on true positives (accurate predictions). Both ignore the models' capacity to accurately predict negative cases. As for F-scores, they sometimes give more weight to precision versus recall, providing misleading results for classifiers biased toward predicting the majority class. Besides this, they're only able to focus on only one class.

In the natural language processing domain, the researchers highlight issues with benchmarks like BLEU and ROUGE. BLEU doesn't consider recall and doesn't correlate with human judgments of machine translation quality, and ROUGE doesn't adequately cover tasks that rely on extensive paraphrasings such as abstractive summarization and extractive summarization of transcripts with many different speakers, like meeting transcripts.

The researchers found that better metric alternatives such as the Matthews correlation coefficient and the Fowlkes-Mallows index, which address some of the shortcomings in accuracy and F-score metrics, weren't used in any of the papers they analyzed. In fact, in 83.1% of the benchmark data sets where the top-level metric "accuracy" was reported, there weren't any other top-level metrics, and F-measure was the only metric in 60.9% of the data sets. This was also true of the natural language processing metrics. METEOR, which has been shown to strongly correlate with human judgment across tasks, was used only 13 times. And GLEU, which aims to assess how well generated text conforms to "normal" language usage, appeared only three times.

The researchers concede their decision to analyze preprints as opposed to papers accepted to scientific journals could skew the results of their study. However, they stand behind their conclusion that the majority of metrics currently used to evaluate AI benchmark tasks have properties potentially resulting in an inadequate reflection of a classifiers' performance, especially when used with imbalanced datasets. "While alternative metrics that address problematic properties have been proposed, they are currently rarely applied as performance metrics in benchmarking tasks, where a small set of historically established metrics is used instead. NLP-specific tasks pose additional challenges for metrics design due to language and task-specific complexities," the researchers wrote.

A growing number of academics are calling for a focus on scientific advancement in AI rather than better performance on benchmarks. In a June interview, Denny Britz, a former resident on the Google Brain team, said he believed that chasing state-of-the-art is bad practice because there are too many confounding variables and because it favors large, well-funded labs like DeepMind and OpenAI. Separately, Zachary Lipton (an assistant professor at Carnegie Mellon University) and Jacob Steinhardt (a member of the statistics faculty at the University of California, Berkeley) proposed in a recent meta-analysis that AI researchers hone in on the how and why of an approach as opposed to performance and conduct more error analysis, ablation studies, and robustness checks in the course of research.

More