Researchers are working toward more transparent language models

The most sophisticated AI language models, like OpenAI's GPT-3, can perform tasks from generating code to drafting marketing copy. But many of the underlying mechanisms remain opaque, making these models prone to unpredictable -- and sometimes toxic -- behavior. As recent research has shown, even careful calibration can't always prevent language models from making sexist associations or endorsing conspiracies.

Newly proposed explainability techniques promise to make language models more transparent than before. While they aren't silver bullets, they could be the building blocks for less problematic models -- or at the very least models that can explain their reasoning.

Citing sources

A language model learns the probability of how often a word occurs based on sets of example text. Simpler models look at the context of a short sequence of words, whereas larger models work at the level of phrases, sentences, or paragraphs. Most commonly, language models deal with words -- sometimes referred to as tokens.

Indeed, the largest language models learn to write humanlike text by internalizing billions of examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences in near-real-time.

Many studies demonstrate the shortcomings of this training approach. Even GPT-3 struggles with nuanced topics like morality, history, and law; language models writ large have been shown to exhibit prejudices along race, ethnic, religious, and gender lines. Moreover, language models don't understand language the way humans do. Because they typically pick up on only a few key words in a sentence, they can't tell when words in a sentence are jumbled up -- even when the new order changes the meaning.

A recent paper coauthored by researchers at Google outlines a potential, partial solution: a framework called Attributable to Identified Sources. It's designed to evaluate the sources (e.g., Reddit and Wikipedia) from which a language model might pull when, for example, answering a particular question. The researchers say that the framework can be used to assess whether statements from a model were derived from a specific source. With it, users can figure out to which source the model is attributing its statements, showing evidence for its claims.

"With recent improvements in natural language generation ... models for various applications, it has become imperative to have the means to identify and evaluate whether [model] output is only sharing verifiable information about the external world," the researcher wrote in a paper. "[Our framework] could serve as a common framework for measuring whether model-generated statements are supported by underlying sources."

The coauthors of another study take a different tack to language model explainability. They propose leveraging "prototype" models -- Proto-Trex -- incorporated into a language model's architecture that can explain the reasoning process behind the model's decisions. While the interpretability comes with a trade-off in accuracy, the researchers say that the results are "promising" in providing helpful explanations that shed light on language models' decision-making.

In the absence of a prototype model, researchers at École Polytechnique Fédérale de Lausanne (EPFL) generated "knowledge graph" extracts to compare variations of language models. (A knowledge graph represents a network objects, events, situations, or concepts and illustrates the relationship between them.) The framework can identify the strengths of each model, the researchers claim, allowing users to compare models, diagnose their strengths and weaknesses, and identify new datasets to improve their performance.

"These generated knowledge graphs are a large step towards addressing the research questions: How well does my language model perform in comparison to another (using metrics other than accuracy)? What are the linguistic strengths of my language model? What kind of data should I train my model on to improve it further?" the researchers wrote. "Our pipeline aims to become a diagnostic benchmark for language models, providing an alternate approach for AI practitioners to identify language model strengths and weaknesses during the model training process itself."

Limitations to interpretability

Explainability in large language models is by no means a solved problem. As one study found, there's an "interpretability illusion" that arises when analyzing a popular architecture of language model called bidirectional encoder representations from transformers (BERT). Individual components of the model may incorrectly appear to represent a single, simple concept, when in fact that they're representing something far more complex.

There's another, more existential pitfall in model explainability: over-trust. A 2018 Microsoft stud y found that transparent models can make it harder for non-experts to detect and correct a model's mistakes. More recent work suggests that interpretability tools like Google's Language Interpretability Tool, particularly those that give an overview of a model via data plots and charts, can lead to incorrect assumptions about the dataset and models, even when the output is manipulated to show explanations that make no sense.

It's what's known as the automation bias -- the propensity for people to favor suggestions from automated decision-making systems. Combating it isn't easy, but researchers like Georgia Institute of Technology's Upol Ehsan believe that explanations given by "glassbox" AI systems, if customized to people's level of expertise, would go a long way.

"The goal of human-centered explainable AI is not just to make the user agree to what the AI is saying. It is also to provoke reflection," Ehsan said, speaking to MIT Tech Review.