Facebook's Dynabench now scores NLP models for metrics like 'fairness'

Last September, Facebook introduced Dynabench, a platform for AI data collection and benchmarking that uses humans and models "in the loop" to create challenging test datasets. Leveraging a technique called dynamic adversarial data collection, Dynabench measures how easily humans can fool AI, which Facebook believes is a better indicator of a model's quality than any provided by current benchmarks.

Today, Facebook updated Dynabench with Dynaboard, an evaluation-as-a-service platform for conducting evaluations of natural language processing models on demand. The company claims Dynaboard makes it possible to perform apples-to-apples comparisons of models without problems from bugs in test code, inconsistencies in filtering test data, and other reproducibility issues.

"Importantly, there is no single correct way to rank models in AI research," Facebook wrote in a blog post. "Since launching Dynabench, we've collected over 400,000 examples, and we've released two new, challenging datasets. Now we have adversarial benchmarks for all four of our initial official tasks within Dynabench, which initially focus on language understanding ... Although other platforms have addressed subsets of current issues, like reproducibility, accessibility, and compatibility, [Dynabench] addresses all of these issues in one single end-to-end solution."

Dynabench

A number of studies imply that commonly used benchmarks do a poor job of estimating real-world AI performance. One recent report found that 60-70% of answers given by natural language processing (NLP) models were embedded somewhere in the benchmark training sets, indicating that the models were often simply memorizing answers. Another study -- a meta analysis of over 3,000 AI papers -- found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

Facebook's solution to this is what it calls the Dynascore, a metric designed to capture model performance on the axes of accuracy, compute, memory, robustness, and "fairness." The Dynascore allows AI researchers to tailor an evaluation by placing greater or less emphasis (or weight) on a collection of tests.

As users employ the Dynascore to gauge the performance of models, Dynabench tracks which examples fool the models and lead to incorrect predictions across the core tasks of natural language inference, question answering, sentiment analysis, and hate speech. These examples improve the systems and become part of more challenging datasets that train the next generation of models, which can in turn be benchmarked with Dynabench to create a "virtuous cycle" of research progress.

Crowdsourced annotators connect to Dynabench and receive feedback on a model's response. This enables them to employ tactics like making the model focus on the wrong word or attempt to answer questions requiring real-world knowledge. All examples on Dynabench are validated by other annotators, and if the annotators don't agree with the original label, the example is discarded from the test set.

Metrics

In the new Dynascore on Dynabench's Dynaboard, "accuracy" refers to the number of examples the model got right as a percentage. The exact accuracy metric is task-dependent -- while tasks can have multiple accuracy metrics, only one metric decided by the task owners can be used as a part of the ranking function.

Compute, another component of the Dynascore, measures the computational efficiency of an NLP model. To account for computation, Dynascore measures the number of examples the model can process per second on its instance in Facebook's evaluation cloud.

To calculate memory usage, Dynascore measures the amount of memory a model requires in gigabytes of total memory usage. Memory usage over the duration the model is running is averaged over time, with measurements taken over a set period of seconds.

Dynascore also measures robustness, or typographical errors and local paraphrases a model might make during benchmarking. The platform measures changes after adding "perturbations" to the examples, testing whether, for example, a model can capture that a "baaaad restaurant" is not a good restaurant.

Finally, Facebook claims Dynascore can evaluate a model's fairness with a test that substitutes, among other things, noun phrase gender (e.g., replacing "sister" with "brother" or "he" with "they") in datasets and names with others that are statistically predictive of another race or ethnicity. For the purposes of Dynaboard scoring, a model is considered "fair" if its predictions remain stable after these changes.

Facebook admits that this fairness metric isn't perfect. Replacing "his" with "hers" or "her" might make sense in English, for example, but can sometimes result in contextual mistakes. If Dynaboard were to replace "his" with "her" in the sentence "this cat is his," the result would be "this cat is her," which doesn't maintain the original meaning.

"At the launch of Dynaboard, we're starting off with an initial metric relevant to NLP tasks that we hope serves as a starting point for collaboration with the broader AI community," Facebook wrote. "Because the initial metric leaves room for improvement, we hope that the AI community will build on Dynaboard's ... platform and make progress on devising better metrics for specific contexts for evaluating relevant dimensions of fairness in the future."

Calculating a score

To combine the disparate metrics into a single score that can be used to rank models in Dynabench, Dynaboard finds the "exchange rate" between metrics that can be applied to standardize units across metrics. The platform takes a weighted average to calculate the Dynascore so the models can be dynamically re-ranked in real time as the weights are adjusted.

To compute the rate at which the adjustments are made, Dynaboard uses a formula called the "marginal rate of substitution" (MRS), which in economics is the amount of good a consumer is willing to give up for another good while getting the same utility. Arriving at the default Dynascore involves estimating the average rate at which users are willing to trade off each metric for a one-point gain in performance and using that to convert all metrics into units of performance.

Dynaboard is available for researchers to submit their own model for evaluation via a new command-line interface tool and library called Dynalab. In the future, the company plans to open Dynabench up so anyone can run their own task or models in the loop for data collection while hosting their own dynamic leaderboards.

"The goal of the platform is to help show the world what state-of-the-art NLP models can achieve today, how much work we have yet to do, and in doing so, help bring about the next revolution in AI research," Facebook continued. "We hope Dynabench will help the AI community build systems that make fewer mistakes, are less subject to potentially harmful biases, and are more useful and beneficial to people in the real world."

Dynabench

Metrics

Calculating a score

More