AI models from Microsoft and Google already surpass human performance on the SuperGLUE language benchmark

In late 2019, researchers affiliated with Facebook, New York University (NYU), the University of Washington, and DeepMind proposed SuperGLUE, a new benchmark for AI designed to summarize research progress on a diverse set of language tasks. Building on the GLUE benchmark, which had been introduced one year prior, SuperGLUE includes a set of more difficult language understanding challenges, improved resources, and a publicly available leaderboard.

When SuperGLUE was introduced, there was a nearly 20-point gap between the best-performing model and human performance on the leaderboard. But as of early January, two models -- one from Microsoft called DeBERTa and a second from Google called T5 + Meena -- have surpassed the human baselines, becoming the first to do so.

Sam Bowman, assistant professor at NYU's center for data science, said the achievement reflected innovations in machine learning including self-supervised learning, where models learn from unlabeled datasets with recipes for adapting the insights to target tasks. "These datasets reflect some of the hardest supervised language understanding task datasets that were freely available two years ago," he said. "There's no reason to believe that SuperGLUE will be able to detect further progress in natural language processing, at least beyond a small remaining margin."

But SuperGLUE isn't a perfect -- nor a complete -- test of human language ability. In a blog post, the Microsoft team behind DeBERTa themselves noted that their model is "by no means" reaching the human-level intelligence of natural language understanding. They say this will require research breakthroughs, along with new benchmarks to measure them and their effects.

SuperGLUE

As the researchers wrote in the paper introducing SuperGLUE, their benchmark is intended to be a simple, hard-to-game measure of advances toward general-purpose language understanding technologies for English. It comprises eight language understanding tasks drawn from existing data and accompanied by a performance metric as well as an analysis toolkit.

The tasks are:

Boolean Questions (BoolQ) requires models to respond to a question about a short passage from a Wikipedia article that contains the answer. The questions come from Google users, who submit them via Google Search.
CommitmentBank (CB) tasks models with identifying a hypotheses contained within a text excerpt from sources including the Wall Street Journal and determining whether the hypothesis holds true.
Choice of plausible alternatives (COPA) provides a premise sentence about topics from blogs and a photography-related encyclopedia from which models must determine either the cause or effect from two possible choices.
Multi-Sentence Reading Comprehension (MultiRC) is a question-answer task where each example consists of a context paragraph, a question about that paragraph, and a list of possible answers. A model must predict which answers are true and false.
Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD) has models predict masked-out words and phrases from a list of choices in passages from CNN and the Daily Mail, where the same words or phrases might be expressed using multiple different forms, all of which are considered correct.
Recognizing Textual Entailment (RTE) challenges natural language models to identify whenever the truth of one text excerpt follows from another text excerpt.
Word-in-Context (WiC) provides models two text snippets and a polysemous word (i.e., word with multiple meanings) and requires them to determine whether the word is used with the same sense in both sentences.
Winograd Schema Challenge (WSC) is a task where models, given passages from fiction books, must answer multiple-choice questions about the antecedent of ambiguous pronouns. It's designed to be an improvement on the Turing Test.

SuperGLUE also attempts to measure gender bias in models with Winogender Schemas, pairs of sentences that differ only by the gender of one pronoun in the sentence. However, the researchers note that this measure has limitations in that it offers only positive predictive value: While a poor bias score is clear evidence that a model exhibits gender bias, a good score doesn't mean the model is unbiased. Moreover, it doesn't include all forms of gender or social bias, making it a coarse measure of prejudice.

To establish human performance baselines, the researchers drew on existing literature for WiC, MultiRC, RTE, and ReCoRD and hired crowdworker annotators through Amazon's Mechanical Turk platform. Each worker, paid an average of $23.75 an hour, completed a short training phase before annotating up to 30 samples of selected test sets using instructions and an FAQ page.

Architectural improvements

The Google team hasn't yet detailed the improvements that led to its model's record-setting performance on SuperGLUE, but the Microsoft researchers behind DeBERTa detailed their work in a blog post published earlier this morning. DeBERTa isn't new -- it was open-sourced last year -- but the researchers say they trained a larger version with 1.5 billion parameters (i.e., the internal variables that the model uses to make predictions). It'll be released in open source and integrated into the next version of Microsoft's Turing natural language representation model, which supports products like Bing, Office, Dynamics, and Azure Cognitive Services.

DeBERTa is pretrained through masked language modeling (MLM), a fill-in-the-blank task where a model is taught to use the words surrounding a masked "token" to predict what the masked word should be. DeBERTa uses both the content and position information of context words for MLM, such that it's able to recognize "store" and "mall" in the sentence "a new store opened beside the new mall" play different syntactic roles, for example.

Unlike some other models, DeBERTa accounts for words' absolute positions in the language modeling process. Moreover, it computes the parameters within the model that transform input data and measure the strength of word-word dependencies based on words' relative positions. For example, DeBERTa would understand the dependency between the words "deep" and "learning" is much stronger when they occur next to each other than when they occur in different sentences.

DeBERTa also benefits from adversarial training, a technique that leverages adversarial examples derived from small variations made to training data. These adversarial examples are fed to the model during the training process, improving its generalizability.

The Microsoft researchers hope to next explore how to enable DeBERTa to generalize to novel tasks of subtasks or basic problem-solving skills, a concept known as compositional generalization. One path forward might be incorporating so-called compositional structures more explicitly, which could entail combining AI with symbolic reasoning -- in other words, manipulating symbols and expressions according to mathematical and logical rules.

"DeBERTa surpassing human performance on SuperGLUE marks an important milestone toward general AI," the Microsoft researchers wrote. "[But unlike DeBERTa,] humans are extremely good at leveraging the knowledge learned from different tasks to solve a new task with no or little task-specific demonstration."

New benchmarks

According to Bowman, no successor to SuperGLUE is forthcoming, at least not in the near term. But there's growing consensus within the AI research community that future benchmarks, particularly in the language domain, must take into account broader ethical, technical, and societal challenges if they're to be useful.

For example, a number of studies show that popular benchmarks do a poor job of estimating real-world AI performance. One recent report found that 60%-70% of answers given by natural language processing models were embedded somewhere in the benchmark training sets, indicating that the models were usually simply memorizing answers. Another study -- a meta-analysis of over 3,000 AI papers -- found that metrics used to benchmark AI and machine learning models tended to be inconsistent, irregularly tracked, and not particularly informative.

Part of the problem stems from the fact that language models like OpenAI's GPT-3, Google's T5 + Meena, and Microsoft's DeBERTa learn to write humanlike text by internalizing examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs.

As a result, language models often amplify the biases encoded in this public data; a portion of the training data is not uncommonly sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism." Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular models, including Google's BERT and XLNet, OpenAI's GPT-2, and Facebook's RoBERTa. This bias could be leveraged by malicious actors to foment discord by spreading misinformation, disinformation, and outright lies that "radicalize individuals into violent far-right extremist ideologies and behaviors," according to the Middlebury Institute of International Studies.

Most existing language benchmarks fail to capture this. Motivated by the findings in the two years since SuperGLUE's introduction, perhaps future ones might.