AI researchers from MIT, Intel, and Canadian AI initiative CIFAR have found high levels of stereotypical bias from some of the most popular pretrained models like Google’s BERT and XLNet, OpenAI’s GPT-2, and Facebook’s RoBERTa. The analysis was performed as part of the launch of StereoSet, a data set, challenge, leaderboard, and set of metrics for evaluating racism, sexism, and stereotypes related to religion and profession in pretrained language models.

The authors believe their work is the first large-scale study to show stereotypes in pretrained language models beyond gender bias. BERT is generally known as one of the top performing language models in recent years, while GPT-2, RoBERTa, and XLNet each claimed top spots on the GLUE leaderboard last year. Half of the GLUE leaderboard top 10 today including RoBERTa are variations of BERT.

The team evaluated pretrained language models based on both language modeling ability and stereotypical bias. A small version of OpenAI’s GPT-2 tops the StereoSet leaderboard in early testing. Several examples of how each model performs in each area of bias can be found on the StereoSet website.

Above: Examples of StereoSet intersentence pairings for bias analysis from an ensemble of language models evaluated in the work

Image Credit: StereoSet

The StereoSet data set comes with about 17,000 test instances for carrying out Context Association Tests (CAT) that measure a language model’s ability and bias. An idealized CAT score, or ICAT, combines language model performance and stereotype scores.

VB Transform 2020 Online - July 15-17. Join leading AI executives: Register for the free livestream.

“We show that current pretrained language model exhibit strong stereotypical biases, and that the best model is 27 ICAT points behind the idealistic language model,” the paper reads. “We find that the GPT-2 family of models exhibit relatively more idealistic behavior than other pretrained models like BERT, RoBERTa, and XLNet.”

Above: Examples of StereoSet intrasentence pairings for bias analysis from an ensemble of language models considered in the work

Image Credit: StereoSet

Contributors to the work published on preprint repository arXiv include MIT student Moin Nadeem, Intel AI for Social Good lead Anna Bethke, and McGill University associate professor and CIFAR Facebook AI chair Siva Reddy. Pretrained models are known for capturing stereotypical bias because they’re trained on large data sets of real-world data.

Researchers believe GPT-2 may strike a balance better than other models because it’s built using data from Reddit. Another challenge introduced last week for training language models to better give advice to humans also relies on subreddit communities for training language models.

“Since Reddit has several subreddits related to target terms in StereoSet (e.g., relationships, religion), GPT2 is likely to be exposed to correct contextual associations,” the paper reads. “Also, since Reddit is moderated in these niche subreddits (ie./r/feminism), it could be the case that both stereotypical and anti-stereotypical associations are learned.”

Researchers were surprised to find no correlation between the size of the data set used to train a model and its ideal CAT score.

“As the language model becomes stronger, so its stereotypical bias (SS) too. This is unfortunate and perhaps unavoidable as long as we rely on real world distribution of corpora to train language models since these corpora are likely to reflect stereotypes (unless carefully selected),” the paper reads. “This could be due to the difference in architectures and the type of corpora these models are trained on.”

To examine bias, StereoSet runs models through sentence-specific or intrasentence fill-in-the-blank tests, as well as dialog or intersentence tests. In both instances, models are asked to choose between three associative context words related to a subject.

The StereoSet data set of associated terms and stereotypes were assembled by paid Mechanical Turk employees in the United States. Researchers say this approach may come with limitations because the majority of Mechanical Turk employees are under the age of 50.

In the past, a number of NLP researchers have used analysis methods like word embeddings, which can reveal a model’s preference, for example, that a doctor is a man and a nurse is a woman. The CAT test was inspired by previous work in bias evaluation, particularly the word embedding association test (WEAT).

In similar but unrelated work, last week a group of researchers from 30 organizations including Google and OpenAI recommended AI companies do bias bounties and create a third-party bias and safety audit market, among ways to turn AI ethics principles into practice.

The news follows a study last month that found major automated speech detection systems were more likely to recognize white voices than black voices. Last year, researchers found OpenAI’s GPT-2 generates different responses when prompted with race-related language.