Allen Institute researchers find pervasive toxicity in popular language models

Researchers at the Allen Institute for AI have created a data set -- RealToxicityPrompts -- that attempts to elicit racist, sexist, or otherwise toxic responses from AI language models, as a way of measuring the models' preferences for these responses. In experiments, they claim to have found that no current machine learning technique sufficiently protects against toxic outputs, underlining the need for better training sets and model architectures.

It's well-established that models amplify the biases in data on which they were trained. That's problematic in the language domain, because a portion of the data is often sourced from communities with pervasive gender, race, and religious prejudices. AI research firm OpenAI notes that this can lead to placing words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism." Other studies, like one published by Intel, MIT, and Canadian AI initiative CIFAR researchers in April, have found high levels of stereotypical bias from some of the most popular models, including Google's BERT and XLNet, OpenAI's GPT-2, and Facebook's RoBERTa.

The Allen Institute researchers designed RealToxicityPrompts to measure the risk of "toxic degeneration" by pretrained language models, or models fed data sets containing thousands to billions of documents. They compiled a list of 100,000 naturally occurring prompts extracted from a large corpus of English Reddit text (the open source Open-WebText Corpus) and paired it with toxicity scores from Google's Perspective API, which uses machine learning models to detect the potential toxicity of a comment.

The coauthors evaluated five language models using RealToxicityPrompts, specifically three models from OpenAI (GPT-1 GPT-2, and GPT-3) and two models from Salesforce (CTRL and CTRL-Wiki). The found that while toxic prompts -- prompts offensive or stereotypically biased on their face -- were 70% or more likely to yield toxic content from the language models, even non-toxic prompts resulted in offensive responses. The results show that all models were 49% or more likely to answer non-toxic content with toxic responses, even models like CTRL-Wiki that were only trained on Wikipedia data.

To uncover the potential reasons for this, the researchers investigated the corpora used to pretrain several of the language models: OpenAI-WT (GPT-2's training data) and OWTC (an open source fork of OpenAI-WT). OWTC contains text from Reddit posts with a karma of 3 or higher and 38GB of English documents, including news articles. OpenAI-WT -- which has a 29% overlap with OWTC, such that at least 2.3 million documents in OpenAI-WT also appear in OWTC -- contains about 8 million documents filtered using a blocklist of sexually explicit and otherwise offensive subreddits.

The researchers found that OWTC and OpenAI-WT contain "non-negligible" amounts of toxicity as identified by the Perspective API. About 2.1% of documents in OWTC were offensive compared with 4.3% in OpenAI-WT, or twice that of OWTC despite the blocklist. Unreliable news sites were another major source of toxicity in the data sets, as were posts from banned or quarantined subreddits. In fact, 63,000 documents in OpenAI-WT and OWTC came from links shared on problematic Reddit communities; GPT-2 was pretrained on at least 40,000 documents from the quarantined /r/The_Donald and 4,000 documents from the banned /r/WhiteRights.

"Overall, our investigations demonstrate that toxicity is a prevalent issue in both neural language generation and web text corpora," the coauthors wrote in a paper describing their work. "Although they show some reduction in toxicity, steering methods do not fully protect neural models from toxic degeneration. Additionally, the corpora that language models are pretrained on contain non-negligible amounts of toxic, abusive, and untrustworthy content."

More