How bias creeps into the AI designed to detect toxicity

In 2017, Google's Counter Abuse Technology team and Jigsaw, the organization working under Google parent company Alphabet to tackle cyberbullying and disinformation, released an AI-powered API for content moderation called Perspective. Perspective's goal is to "identify toxic comments that can undermine a civil exchange of ideas," offering a score from zero to 100 on how similar new comments are to others previously identified as toxic, defined as how likely a comment is to make someone leave a conversation.

Jigsaw claims its AI can immediately generate an assessment of a phrase's toxicity more accurately than any keyword blacklist and faster than any human moderator. But studies show that technologies similar to Jigsaw's still struggle to overcome major challenges, including biases against specific subsets of users.

For example, a team at Penn State recently found that posts on social media about people with disabilities could be flagged as more negative or toxic by commonly used public sentiment and toxicity detection models. After training several of these models to complete an open benchmark from Jigsaw, the team observed that the models learned to associate "negative-sentiment" words like "drugs," "homelessness," "addiction," and "gun violence" with disability -- and the words "blind," "autistic," "deaf," and "mentally handicapped" with a negative sentiment.

"The biggest issue is that they are public models that are easily used to classify texts based on sentiment," Pranav Narayanan Venkit and Shomir Wilson, the coauthors of the paper, told VentureBeat via email. Narayanan Venkit is a Ph.D. student in informatics at Penn State and Wilson is an assistant professor in Penn State's College of Information Sciences. "The results are important as they show how machine learning solutions are not perfect and how we need to be more responsible for the technology we create. Such outright discrimination is both wrong and detrimental to the community as it does not represent such communities or languages accurately."

Emergent biases

Studies show that language models amplify the biases in data on which they were trained. For instance, Codex, a code-generating model developed by OpenAI, can be prompted to write "terrorist" when fed the word "Islam." Another large language model from Cohere tends to associate men and women with stereotypically "male" and "female" occupations, like "male scientist" and "female housekeeper."

That's because language models are essentially a probability distribution over words or sequences of words. In practice, a model gives the probability of a word sequence being "valid" -- i.e., resembling how people write. Some language models are trained on hundreds of gigabytes of text from occasionally toxic websites and so learn to correlate certain genders, races, ethnicities, and religions with "negative" words, because the negative words are overrepresented in the texts.

The model powering Perspective was built to classify rather than generate text. But it learns the same associations -- and therefore biases -- as generative models. In a study published by researchers at the University of Oxford, the Alan Turing Institute, Utrecht University, and the University of Sheffield, an older version of Perspective struggled to recognize hate speech that used "reclaimed" slurs like "queer" and spelling variations like missing characters. An earlier University of Washington paper published in 2019 found that Perspective was more likely to label tweets from Black users offensive versus tweets from white users.

The problem extends beyond Jigsaw and Perspective. In 2019, engineers at Meta (formerly Facebook) reportedly discovered that a moderation algorithm at Meta-owned owned Instagram was 50% more likely to ban Black users than white users. More recent reporting revealed that, at one point, the hate speech detection systems Meta used on Facebook aggressively detected comments denigrating white people more than attacks on other demographic groups.

For its part, Jigsaw acknowledges that Perspective doesn't always work well in certain areas. But the company stresses that it's intent on reducing false positives for "the high toxicity thresholds that most Perspective users employ." Perspective users can adjust the confidence level that the model must reach to deem a comment "toxic." Jigsaw says that most users -- which include The New York Times -- set the threshold at .7-.9 (70% to 90% confident, where .5 [50%] equates to a coin flip).

"Perspective scores already represent probabilities, so there is some confidence information included in the score (low or high toxicity scores can be viewed as high confidence, while mid-range score indicate low confidence)," Jigsaw conversation AI software engineer Lucy Vasserman told VentureBeat via email. "We're working on clarifying this concept of uncertainty further."

Measuring uncertainty and detecting toxicity

A silver bullet to the problem of biased toxicity detection models remains predictably elusive. But the coauthors of a new study, which has yet to be peer-reviewed, explore a technique they claim could make it easier to detect -- and remove -- problematic word associations that models pick up from data.

In the study, researchers from Meta and the University of North Carolina at Chapel Hill propose what they call a "belief graph," an interface with language models that shows the relationships between a model's "beliefs" (e.g., "A viper is a vertebrate," "A viper has a brain"). The graph is editable, allowing users to "delete" individual beliefs they determine to be toxic, for example, or untrue.

"You could have a model that gives a toxic answer to a question, or rates a toxic statement as 'true,' and you don't want it to do that. So you go in and make a targeted update to that model's belief and change what it says in response to that input," Peter Hase, a Ph.D. student at UNC Chapel Hill and a coauthor of the paper, told VentureBeat via email. "If there are other beliefs that are logically entailed by the new model prediction (as opposed to the toxic prediction), the model is consistent and believes those things too. You don't accidentally change other independent beliefs about the things that the toxic belief involves, whether they're people groups, moral attitudes, or just a specific topic."

Hase believes that this technique could be applied to the kinds of models that are already a part of widely used products like GPT-3, but that it might not be practical because the methods aren't perfect yet. He points to complementary work coming out of Stanford that focuses on scaling the update methods to work with models of bigger sizes, which are likely to be used in more natural language applications in the future.

"Part of our vision in the paper is to make an interface with language models that shows people what they believe and why," Hase added. "We wanted to visualize model beliefs and the connections between them, so people could see when changing one belief changes others, [plus] other interesting things like whether the connections between beliefs are logically consistent."

Context-sensitive models

Another new work from researchers at Imperial College London pitches the idea of "contextual toxicity detection." Compared with models that don't take certain semantics into account, the researchers argue that their models can achieve a lower miss rate (i.e., miss fewer comments that seem harmless on their own but taken in context should be considered toxic) as well as lower false alarm rate (i.e., flag comments that may contain toxic words but aren't toxic if interpreted in context).

"We are proposing prediction models that take previous comments and the root post into account in a structured way, representing them separately from the comment to moderate, [and] preserving the order in which the comments were written," coauthor and Imperial College London natural language professor Lucia Specia told VentureBeat via email. "This is particularly important when the comments are sarcastic, ironic, or vague."

Specia says that context sensitivity can help address existing problems in toxicity detection, like oversensitivity to African American Vernacular English. For example, comments containing the word "ass" are often flagged toxic by tools like Perspective. But the researchers' models can understand from the context when it's a friendly, harmless conversation.

"These models can certainly be implemented in production. Our results in the paper are based on training on a small dataset of tweets in context, but if the models are trained with sufficient contextual data -- which social media platforms have access to -- I am confident they can achieve much higher accuracies than context-unaware models," Specia added.

Uncertainty

Jigsaw says it's investigating a different concept -- uncertainty -- that incorporates the confidence in the toxicity rating into the model. The company claims that uncertainty can help moderators prioritize their time where it's most needed, especially in areas where the risk for bias or model errors is significant -- such as community-specific language, context-dependent content, and content outside of the online conversation domain.

"We're currently working to improve how we handle uncertainty and ensure that content that may be difficult for the models to handle properly receives low confidence scores. We're [also] exploring how users understand the model's confidence, weighing possibilities like adding more documentation for how to interpret the scores," Vasserman said. "The latest technology of large language models and new serving infrastructure to serve them has improved our modeling abilities in several ways. These models have allowed us to serve many more languages than previously possible and they have reduced bias ... In addition, the specific large language model we have chosen to use is also more resilient to misspellings and intentional misspellings, because it breaks words down into characters and pieces rather than evaluating the word as a whole."

Toward this end, Perspective today rolled out support for 10 new languages: Arabic, Chinese, Czech, Dutch, Indonesian, Japanese, Korean, Polish, Hindi, and Hinglish -- a mix of English and Hindi transliterated using Latin characters. Previously, Perspective was available in English, French, German, Italian, Portuguese, Russian, and Spanish.

Detecting biases through annotation

Research shows that attempting to "debias" biased toxicity detection models is less effective than addressing the root cause of the problem: the training datasets. In a study from the Allen Institute for AI, Carnegie Mellon, and the University of Washington, researchers investigated how differences in dialect can lead to racial biases in automatic hate speech detection models. They found that annotators -- the people responsible for adding labels to the training datasets that serve as examples for the models -- are more likely to label phrases in the African American English (AAE) dialect more toxic than general American English equivalents, despite their being understood as non-toxic by AAE speakers.

Toxicity detectors are trained on input data -- text -- annotated for a particular output -- "toxic" or "nontoxic" -- until they can detect the underlying relationships between the inputs and output results. During the training phase, the detector is fed with labeled datasets, which tell it which output is related to each specific input value. The learning process progresses by constantly measuring the outputs and fine-tuning the system to get closer to the target accuracy.

Beyond language, the computer vision domain is rife with examples of prejudice arising from biased annotations. Research has found that ImageNet and Open Images -- two large, publicly available image datasets -- are U.S.- and Euro-centric. Models trained on these datasets perform worse on images from Global South countries. For example, images of grooms are classified with lower accuracy when they come from Ethiopia and Pakistan, compared to images of grooms from the United States.

Cognizant of issues that can arise during the dataset labeling process, Jigsaw says that it has conducted experiments to determine how annotators from different backgrounds and experiences classify things according to toxicity. In a study expected to be published in early 2022, researchers at the company found differences in the annotations between labelers who self-identified as African Americans and members of LGBTQ+ community versus annotators who didn't identify as either of those two groups.

"This research is in its early stages, and we're working on expanding into additional identities too," Vasserman said. "We're [also] working on how best to integrate those annotations into Perspective models, as this is non-trivial. For example, we could always average annotations from different groups or we could choose to use annotations from only one group on specific content that might be related to that group. We're still exploring the options here and the different impact of each potential choice."

Acceptable tradeoffs

Jigsaw's Perspective API is processing over 500 million requests daily for media organizations including Vox Media, OpenWeb, and Disqus. Facebook has applied its automatic hate speech detection models to content from the billions of people that use its social networks. And they aren't the only ones.

But as the technology stands today, even models developed with the best of intentions are likely to make mistakes that disproportionately impact disadvantaged groups. How often they make those mistakes -- and the heavy-handedness of the moderation that those mistakes inform -- is ultimately up to the platforms.

Perspective claims to err on the side of transparency, allowing publishers to show readers feedback on the predicted toxicity of their comments and filter conversations based on the level of predicted toxicity. Other platforms, like Facebook, are more opaque about the predictions that their algorithms make.

Hase argues that explainability is increasingly "critical" as language models become more capable -- and are are entrusted with more complex tasks. During testimony before the U.S. Congress in April 2018, Facebook CEO Mark Zuckerberg infamously predicted that AI could take a primary role in automatically detecting hate speech on Facebook in the next 5 to 10 years. Leaked documents suggest that the company is far from achieving that goal -- but that it hasn't communicated this to its users.

"Work on making language models explainable is an important part of checking whether models deserve this trust at all," Hase said.