Researchers detail blind spots of large language models

Modern AI-powered language systems like OpenAI's GPT-3 can generate impressively fluent and grammatical text. But they aren't perfect. While these systems rarely make syntactic errors, they're prone to breaking semantic and narrative rules or struggling with repetition. For example, they might change the subject of a conversation without a segue or answer a question with an illogical statement.

To measure the extent to which systems suffer from these shortcomings, researchers at the Allen Institute for AI developed Scarecrow, a framework that provides a way for developers to mark problems in AI-generated text. In an analysis spanning 13,000 annotations of 1,300 paragraphs from both AI systems and humans, they found that scaling up the size of models powering the systems helps mitigate some issues but others might require more involved fixes.

Categorizing model errors

The researchers applied their framework to OpenAI's GPT-2 and GPT-3, as well as Grover, a fake news generator and detector from the University of Washington. As the team explains in a paper, Scarecrow divides errors into 10 categories identified by combining expert analysis with crowdsourced annotation:

Grammar and usage: Missing words, extra words, and incorrect or out-of-order words.
Redundant: Repeated words or phrases, or ideas repeated using different words.
Off-prompt: A phrase or sentence unrelated -- or contradictory -- to a prompt given to a language generation system.
Self-contradiction: Text that contradicts another piece of text the system had previously written.
Incoherent: Text that doesn't fit into the above categories but still doesn't make sense.
Technical jargon: Jargon or specific words from an esoteric field.
Needs Google: A fact or figure that appears to be true but requires a Google search to confirm.
Bad math: Problems with basic math and converting fixed units and currencies.
Commonsense: Text that violates our basic understanding of the world.
Encyclopedic: Factually wrong text disproven by textbooks, Wikipedia entries, or encyclopedias.

According to the researchers, certain errors, like Encyclopedic, Commonsense, and Incoherent errors, decrease with models trained on data from particular domains, like news, as well as models containing higher numbers of parameters. (In machine learning, parameters are the parts of models learned from historical training data, and they generally correlate with linguistic sophistication.) But the researchers say parameter scaling benefits seemingly plateau for Off-Prompt, Bad Math, and Grammar and Usage errors.

"These three error categories see a model plateau in error reduction when scaling to GPT-3. Of these error types, humans still commit fewer Off-Prompt and Grammar and Usage errors, but Bad Math appears saturated for our [study]," the researchers wrote.

Self-Contradiction and Redundant errors exhibit more complex scaling behavior, increasing for medium- and large-scale models, depending on interactions with other error types and how the errors are counted. Sampling from a larger set of words makes the models more prone to changing topics but less likely to repeat themselves, and vice versa.

"We posit the reason is that GPT-2 generations [in particular] are so incoherent and off-prompt that there is little opportunity for relevant, comprehensible points to be made and then reversed," the researchers noted in the paper. "We [also] observe GPT-3 will seem stuck on a particular topic, elaborating on and rephrasing similar ideas more times than a human writer would."

The researchers aim to spur explorations of natural language generations at scale, in particular ways errors in language models might be automatically fixed. "This paper focuses on open-ended generation, but a natural extension of this method would be to [assess] constrained generation tasks, such as machine translation," they wrote. "Especially if considering a novel task setting, new error types may [also] prove useful."

Categorizing model errors

More