AI researchers create testing tool to find bugs in NLP from Amazon, Google, and Microsoft

AI researchers have created a language-model testing tool that has discovered major bugs in commercially available cloud AI offerings from Amazon, Google, and Microsoft. Yesterday, a paper detailing the CheckList tool received the Best Paper award from organizers of the Association for Computational Linguistics (ACL) conference. The ACL conference, which took place online this week, is one of the largest annual gatherings for researchers creating language models.

NLP models today are often evaluated based on how they perform on a series of individual tasks, such as answering questions using benchmark data sets with leaderboards like GLUE. CheckList instead takes a task-agnostic approach, allowing people to create tests that fill in cells in a spreadsheet-like matrix with capabilities (in rows) and test types (in columns), along with visualizations and other resources.

Analysis with CheckList found that about one in four sentiment analysis predictions by Amazon's Comprehend change when a random shortened URL or Twitter handle is placed in text, and Google Cloud's Natural Language and Amazon's Comprehend makes mistakes when the names of people or locations are changed in text.

"The [sentiment analysis] failure rate is near 100% for all commercial models when the negation comes at the end of the sentence (e.g. 'I thought the plane would be awful, but it wasn't'), or with neutral content between the negation and the sentiment-laden word," the paper reads.

CheckList also found shortcomings when paraphrasing responses to Quora questions, despite surpassing human accuracy in a Quora Question Pair benchmark challenge. Creators of CheckList from Microsoft, University of Washington, and University of California at Irvine say results indicate that using the approach can improve any existing NLP models.

"While traditional benchmarks indicate that models on these tasks are as accurate as humans, CheckList reveals a variety of severe bugs, where commercial and research models do not effectively handle basic linguistic phenomena such as negation, named entities, coreferences, semantic role labeling, etc, as they pertain to each task," the paper reads. "NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it."

Google's BERT and Facebook AI's RoBERTa were also evaluated using CheckList. Authors said BERT exhibited gender bias in machine comprehension, overwhelmingly predicting men as doctors for example. BERT was also found to always make positive predictions about people who are straight or Asian and negative predictions when dealing with text about people who are atheist, Black, gay, or lesbian. An analysis in early 2020 also found systemic bias among large-scale language models.

In recent months, some of the largest Transformer-based language models devised have come into being, from Nvidia's Megatron to Microsoft's Turing NLG. Large language models have racked up impressive scores in particular tasks. But some NLP researchers argue that a focus on human-level performance on individual tasks ignores ways in which NLP systems are still brittle or less than robust.

As part of a use case test with the team at Microsoft in charge of Text Analytics, a model currently in use by customers that's gone through multiple evaluations, CheckList found previously unknown bugs. The Microsoft team will now use CheckList as part of its workflow when evaluating NLP systems. A collection of people from industry and academia testing AI with the tool over the span of two hours were also able to discover inaccuracies or bugs in state-of-the-art NLP models. An open source version of CheckList is currently available on GitHub.

Sometimes referred to as black box testing, behavioral testing is an approach common in software engineering but not in AI. CheckList is able to do testing in areas like sentiment analysis, machine comprehension, and duplicate question detection. It can also analyze capabilities like robustness, fairness, and logic tests in a range of three kinds of tasks.

The authors are unequivocal in their conclusion that benchmark tasks alone are not sufficient for evaluating NLP models, but they also say that CheckList should complement, not replace, existing challenges and benchmark data sets used for measuring performance of language models.

"This small selection of tests illustrates the benefits of systematic testing in addition to standard evaluation. These tasks may be considered 'solved' based on benchmark accuracy results, but the tests highlight various areas of improvement -- in particular, failure to demonstrate basic skills that are de facto needs for the task at hand," the paper reads.

Other noteworthy work at ACL includes research by University of Washington professor Emily Bender and Saarland University professor Alexander Koller that won the best theme award. The paper argues that progress on large neural network NLP models such as GPT-3 or BERT derivatives is laudable, but that members of the media and academia should not refer to large neural networks as capable of understanding or comprehension, and that clarity and humility are needed in the NLP field when defining ideas like meaning or understanding.

"While large neural language models may well end up being important components of an eventual full-scale solution to human-analogous natural language understanding, they are not nearly-there solutions to this grand challenge," the report reads.

Finally, a system from the U.S. Army Research Lab, University of Illinois, Urbana-Champaign, and Columbia University won the Best Demo paper award for its system named GAIA, which allows for text queries of multimedia like photos and videos.

More