Salesforce researchers release framework to test NLP model robustness

In the subfield of machine learning known as natural language processing (NLP), robustness testing is the exception rather than the norm. That's particularly problematic in light of work showing that many NLP models leverage spurious connections that inhibit their performance outside of specific tests. One report found that 60% to 70% of answers given by NLP models were embedded somewhere in the benchmark training sets, indicating that the models were usually simply memorizing answers. Another study -- a meta analysis of over 3,000 AI papers -- found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

This motivated Nazneen Rajani, a senior research scientist at Salesforce who leads the company's NLP group, to create an ecosystem for robustness evaluations of machine learning models. Together with Stanford associate professor of computer science Christopher Ré and University of North Carolina at Chapel Hill's Mohit Bansal, Rajani and the team developed Robustness Gym, which aims to unify the patchwork of existing robustness libraries to accelerate the development of novel NLP model testing strategies.

"Whereas existing robustness tools implement specific strategies such as adversarial attacks or template-based augmentations, Robustness Gym provides a one-stop-shop to run and compare a broad range of evaluation strategies," Rajani explained to VentureBeat via email. "We hope that Robustness Gym will make robustness testing a standard component in the machine learning pipeline."

Robustness Gym provides guidance to practitioners on how key variables -- i.e., their task, evaluation needs, and resource constraints -- can help prioritize what evaluations to run. The suite describes the influence of a given task via a structure and known prior evaluations; needs such as testing generalization, fairness, or security; and constraints like expertise, compute access, and human resources.

Robustness Gym casts all robustness tests into four evaluation "idioms": subpopulations, transformations, evaluation sets, and adversarial attacks. Practitioners can create what are called slices, where each slice defines a collection of examples for evaluation built using one or a combination of evaluation idioms. Users are scaffolded in a simple two-stage workflow, separating the storage of structured side information about examples from the nuts and bolts of programmatically building slices using this information.

Robustness Gym also consolidates slices and findings for prototyping, iterating, and collaborating. Practitioners can organize slices into a test bench that can be versioned and shared, allowing a community of users to together build benchmarks and track progress. For reporting, Robustness Gym provides standard and custom robustness reports that can be auto-generated from test benches and included in paper appendices.

In a case study, Rajani and coauthors had a sentiment modeling team at a "major technology company" measure the bias of their model using subpopulations and transformations. After testing the system on 172 slices spanning three evaluation idioms, the modeling team found a performance degradation on 16 slices of up to 18%.

In a more revealing test, Rajani and team used Robustness Gym to compare commercial NLP APIs from Microsoft (Text Analytics API), Google (Cloud Natural Language API), and Amazon (Comprehend API) with the open source systems BOOTLEG, WAT, and REL across two benchmark datasets for named entity linking. (Named entity linking entails identifying the key elements in a text, like names of people, places, brands, monetary values, and more.) They found that the commercial systems struggled to link rare or less-popular entities, were sensitive to entity capitalization, and often ignored contextual cues when making predictions. Microsoft outperformed other commercial systems, but BOOTLEG beat out the rest in terms of consistency.

"Both Google and Microsoft display strong performance on some topics, e.g. Google on 'alpine sports' and Microsoft on 'skating' ... [but] commercial systems sidestep the difficult problem of disambiguating ambiguous entities in favor of returning the more popular answer," Rajani and coauthors wrote in the paper describing their work. "Overall, our results suggest that state-of-the-art academic systems substantially outperform commercial APIs for named entity linking."

In a final experiment, Rajani's team implemented five subpopulations that capture summary abstractedness, content distillation, positional bias, information dispersion, and information reordering. After comparing seven NLP models, including Google's T5 and Pegasus on an open source summarization dataset across these subpopulations, the researchers found that the models struggled to perform well on examples that were highly distilled, required higher amounts of abstraction, or contained more references to entities. Surprisingly, models with different prediction mechanisms appeared to make "highly correlated" errors, suggesting that existing metrics can't capture meaningful performance differences.

"Using Robustness Gym, we demonstrate that robustness remains a challenge even for corporate giants such as Google and Amazon," Rajani said. "Specifically, we show that public APIs from these companies perform significantly worse than simple string-matching algorithms for the task of entity disambiguation when evaluated on infrequent (tail) entities."

Both the aforementioned paper and Robustness Gym's source code are available as of today.

More