Facebook's Dynabench aims to make AI models more robust through distributed human workers

Facebook today introduced Dynabench, a platform for AI data collection and benchmarking that uses humans and models "in the loop" to create challenging test data sets. Leveraging a technique called dynamic adversarial data collection, Dynabench measures how easily humans can fool AI, which Facebook believes is a better indicator of a model's quality than current benchmarks provide.

A number of studies imply that commonly used benchmarks do a poor job of estimating real-world AI performance. One recent report found that 60%-70% of answers given by natural language processing (NLP) models were embedded somewhere in the benchmark training sets, indicating that the models were often simply memorizing answers. Another study -- a meta-analysis of over 3,000 AI papers -- found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

Facebook's attempt to rectify this was seemingly inspired by the Turing test, a test of a machine's ability to exhibit behavior equivalent to (or indistinguishable from) that of a human. As users employ Dynabench to gauge the performance of their models, the platform tracks which examples fool the models and lead to incorrect predictions. These examples improve the systems and become part of more challenging data sets that train the next generation of models, which can in turn be benchmarked with Dynabench to create a "virtuous cycle" of research progress. At least in theory.

"Dynabench is in essence a scientific experiment to see whether the AI research community can better measure our systems' capabilities and make faster progress," Facebook researchers Douwe Kiela and Adina Williams explained in a blog post. "We are launching Dynabench with four well-known tasks from NLP. We plan to open Dynabench up to the world for all kinds of tasks, languages, and modalities. We hope to spur 'model hackers' to come up with interesting new examples that models get wrong, and spur 'model builders' to build new models that have fewer weaknesses."

Facebook isn't the first to propose a crowd-focused approach to model development. In 2017, the Computational Linguistics and Information Processing Laboratory at the University of Maryland launched a platform dubbed Break It, Build It, which let researchers submit models to users tasked with coming up with examples to defeat them. A 2019 paper described a setup where trivia enthusiasts were instructed to craft questions validated via live human-computer matches. And more recently, researchers at the University College London explored the effect of training AI models on "adversarially collected," human-prepared data sets.

Facebook itself has toyed with the idea of leveraging human-in-the-loop AI training and benchmarking. The groundwork for Dynabench might lie in a paper published by Facebook AI researchers in 2018, in which the coauthors suggest using gamification to motivate users to train better models while collaborating with each other. This foundational work helped improve Facebook's detection of offensive language and led to the release of a data set -- Adversarial Natural Language Inference -- built by having annotators fool models on inferencing tasks. Moreover, the 2018 study likely informed the development of Facebook's recently piloted text-based fantasy role-playing game that iterates between collecting data from volunteers and retraining models on the collected data, enabling researchers to obtain data at one-fifth the price per utterance of crowdsourcing.

"We find this exciting because this approach shows it is possible to build continually improving models that learn from interacting with humans in the wild (as opposed to experiments with paid crowdworkers)," the coauthors of a paper describing the text-based game wrote, referring to the practice of paying crowdworkers through platforms like Amazon Mechanical Turk to perform AI training and benchmarking tasks. "This represents a paradigm shift away from the limited static dataset setup that is prevalent in much of the work of the community."

In Dynabench, benchmarking happens in the cloud over multiple rounds via Torchserve and Captum, an interpretability library for Facebook's PyTorch machine learning framework. During each round, a researcher or engineer selects one or more models to serve as the target to be tested. Dynabench collects examples using these models and periodically releases updated data sets to the community. When new state-of-the-art models catch most or all of the examples that fooled the previous models, a new round can be started with these better models in the loop.

Crowdsourced annotators connect to Dynabench using Mephisto, a platform for launching, monitoring, and reviewing crowdsourced data science workloads. They receive feedback on a given model's response nearly instantaneously, enabling them to employ tactics like making the model focus on the wrong word or attempt to answer questions requiring extensive real-world knowledge.

Facebook says that all examples on Dynabench are validated by other annotators, and that if these annotators don't agree with the original label, the example is discarded. If the example is offensive or there's something else wrong with it, annotators can flag the example, which will trigger an expert review. (Facebook says it hired a dedicated linguist for this purpose.)

The first iteration of Dynabench focuses on four core tasks -- natural language inference, question-answering, sentiment analysis, and hate speech -- in the English NLP domain, which Kiela and Williams say suffers most from rapid benchmark "saturation." (While it took the research community about 18 years to achieve human-level performance on the computer vision benchmark MNIST and about six years to surpass humans on ImageNet, models beat humans on the GLUE benchmark for language understanding after only a year.) Facebook partnered with researchers with academic institutions including the University of North Carolina at Chapel Hill, University College London, and Stanford to identify, develop, and maintain the tasks in Dynabench, and the company says it will use funding to encourage people to annotate tasks -- a critical step in the benchmarking process.

Kiela and Williams assert that because the process can be frequently repeated, Dynabench can be used to identify biases and create examples that test whether the model has overcome them. They also contend that Dynabench makes models more robust to vulnerabilities and other weaknesses, because human annotators can generate lots of examples in an effort to fool them.

"Ultimately, this metric will better reflect the performance of AI models in the circumstances that matter most: when interacting with people, who behave and react in complex, changing ways that can't be reflected in a fixed set of data points," they wrote. "Dynabench can challenge it in ways that a static test can't. For example, a college student might try to ace an exam by just memorizing a large set of facts. But that strategy wouldn't work in an oral exam, where the student must display true understanding when asked probing, unanticipated questions."

It remains to be seen the extent to which Dynabench mitigates model bias, particularly given Facebook's poor track record in this regard. A recent New York Times report found evidence that Facebook's recommendation algorithm encouraged the growth of QAnon, a loosely affiliated group alleging that a cabal of pedophiles is plotting against President Donald Trump. A separate investigation revealed that on Instagram in the U.S. in 2019, Black users were about 50% more likely to have their accounts disabled by automated moderation systems than those whose activity indicated they were white. In January, Seattle University associate professor Caitlin Ring Carlson published results from an experiment in which she and a colleague collected more than 300 posts that appeared to violate Facebook’s hate speech rules and reported them via the service's tools; only about half of the posts were ultimately removed. And in May, owing to a bug that was later fixed, Facebook's automated system threatened to ban the organizers of a group working to hand-sew masks on the platform from commenting or posting, informing them that the group could be deleted altogether.

Facebook says that while Dynabench doesn't currently provide any tools for bias mitigation, a future version might as the research matures. "Measuring bias is still an open question in the research community," a Facebook spokesperson told VentureBeat via email. "As a research community, we need to figure out what kind of biases we don't want models to have, and actively mitigate these ... With Dynabench, annotators try to exploit weaknesses in models, and if a model has unwanted biases, annotators will be able to exploit those to create examples that fool the model. Those examples then become part of the data set, and should enable researchers' efforts to mitigate unwanted biases."

That's putting aside the fact that the crowdsourcing model can be problematic in its own right. Last year, Wired reported on the susceptibility of platforms like Amazon Mechanical Turk to automated bots. Even when the workers are verifiably human, they're motivated by pay rather than interest, which can result in low-quality data -- particularly when they're treated poorly and paid a below-market rate. Researchers including Niloufar Salehi have made attempts at tackling Amazon Mechanical Turk's flaws with efforts like Dynamo, an open-access worker collective, but there's only so much they can do.

For Facebook's part, it says the open nature of Dynabench will enable it to avoid common crowdsourcing pitfalls. The company plans to make it so that anyone can create their own tasks in a range of different languages, and so that some annotators are compensated for any of the work they contribute.

"Dynabench allows anyone to volunteer to be an annotator and create examples to challenge models," the spokesperson said. "We also plan to supplement those volunteer efforts with paid annotators, particularly for tasks that will benefit from experts; we will fairly compensate those annotators (as we do for AI research projects on other crowdsourcing platforms), and they will receive a further bonus if they successfully create examples that fool the models."

As for Kiela and Williams, they characterize Dynabench as a scientific experiment to accelerate progress in AI research. "We hope it will help show the world what state-of-the-art AI models can achieve today as well as how much work we have yet to do," they wrote.