Researchers release dataset to expose racial, religious, and gender biases in language models

Natural language models are the building blocks of apps including machine translators, text summarizers, chatbots, and writing assistants. But there's growing evidence showing that these models risk reinforcing undesirable stereotypes, mostly because a portion of the training data is commonly sourced from communities with gender, race, and religious prejudices. For example, OpenAI's GPT-3 places words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism."

A new study from researchers affiliated with Amazon and the University of California, Santa Barbara aims to shed light specifically on biases in open-ended English natural language generation. (In this context, "bias" refers to the tendency of a language model to generate text perceived as being negative, unfair, prejudiced, or stereotypical against an idea or a group of people with common characteristics.) The researchers created what they claim is the largest benchmark dataset of its kind containing 23,679 prompts, 5 domains, and 43 subgroups extracted from Wikipedia articles. Beyond this, to measure biases from multiple angles, they introduce new metrics with which to measure bias including "psycholinguistic norms," "toxicity," and "gender polarity."

In experiments, the researchers tested three common language models including GPT-2 (GPT-3's predecessor), Google's BERT, and Salesforce's CTRL. The results show that, in general, these models exhibit larger social biases than the baseline Wikipedia text, especially toward historically disadvantaged groups of people.

For example, the three language models strongly associated the profession of "nursing" with women and generated a higher proportion of texts with negative conceptions about men. While text from the models about men contained emotions like "anger," " sadness," "fear," and "disgust," a larger number about women had positive emotions like "joy" and "dominance."

With regard to religion, GPT-2, BERT, and CTRL expressed the most negative sentiments about atheism followed by Islam. A higher percentage of texts generated with Islam prompts were labeled as conveying negative emotions, while on the other hand, Christianity prompts tended to be more cheerful in sentiment. In terms of toxicity, only prompts with Islam, Christianity, and atheism resulted in toxic texts, among which atheism had the largest proportion.

Across ethnicities and races, toxicity from the models was outsize for African Americans. In fact, the share of texts with negative regard for African American groups was at least marginally larger in five out of six models, indicating a consistent bias against African Americans in multiple key metrics.

The coauthors say that the results highlight the importance of studying the behavior of language generation models before they're deployed into a production environment. Failure to do so, they warn, could at the least propagate negative outcomes and experiences for end users.

"Our intuition is that while carefully handpicked language model triggers and choices of language model generations can show some interesting results, they could misrepresent the level of bias that an language model produces when presented with more natural prompts. Furthermore, language model generations in such a contrived setting could reinforce the type of biases that it was triggered to generate while failing to uncover other critical biases that need to be exposed," the researchers wrote.

"Given that a large number of state-of-the-art models on natural language processing tasks are powered by these language generation models, it's of critical importance to properly discover and quantify any existing biases in these models and prevent them from propagating as unfair outcomes and negative experiences to the end users of the downstream applications."

More