AI displays bias and inflexibility in civility detection, study finds

According to a 2019 Pew Center survey, the majority of respondents believe the tone and nature of political debate in the U.S. have become more negative and less respectful. This observation has motivated scientists to study the civility or lack thereof in political discourse, particularly on broadcast television. Given their ability to parse language at scale, one might assume that AI and machine learning systems might be able to aid in these efforts. But researchers at the University of Pennsylvania find that at least one tool, Jigsaw's Perspective API, clearly isn't up to the task.

Incivility is more subtle and nuanced than toxicity, for example, which includes identity slurs, profanity, and threats of violence. While incivility detection is a well-established task in AI, it's not well-standardized, with the degree and type of incivility varying across datasets.

The researchers studied Perspective -- an AI-powered API for content moderation developed by Jigsaw, the organization working under Google parent company Alphabet to tackle cyberbullying and disinformation -- in part because of its widespread use. Media organizations including the New York Times, Vox Media, OpenWeb, and Disqus have adopted it, and it's now processing 500 million requests daily.

To benchmark Perspective's ability to spot incivility, the researchers built a corpus containing 51 transcripts from PBS NewsHour, MSNBC's The Rachel Maddow Show, and Hannity from Fox News. Annotators read through each transcript and identified segments that appeared to be especially uncivil or civil, rating them on a ten-point scale for measures like "polite/rude," "friendly/hostile," "cooperative/quarrelsome," and "calm/agitated." Scores and selections across annotators were composited to net a civility score for each snippet between 1 and 10, where 1 is the most civil and 10 is the least civil possible.

After running the annotated transcript snippets through the Perspective API, the researchers found that the API wasn't sensitive enough to detect differences in levels of incivility for ratings lower than six. Perspective scores increased for higher levels of incivility, but annotator and Perspective incivility scores only agreed 51% of the time.

"Overall, for broadcast news, Perspective cannot reproduce the incivility perception of people," the researchers write. "In addition to the inability to detect sarcasm and snark, there seems to be a problem with over-prediction of the incivility in PBS and FOX [programming]."

In a subsequent test, the researchers sampled thousands of words from each transcript, gathering a total of 2,671, which they fed to Perspective to predict incivility. The results show a problematic trend: Perspective tends to label certain identities -- including "gay," "African-American," "Muslim" and "Islam," "Jew," "women," and "feminism" and "feminist" -- as toxic. Moreover, the API erroneously flags words relating to violence and death (e.g., "die," "kill," "shooting," "prostitution," "pornography," "sexual") even in the absence of incivility, as well as words that in one context could be toxic but in another could refer to a name (e.g., "Dick").

Other auditors have claimed that Perspective doesn't moderate hate and toxic speech equally across groups of people. A study published by researchers at the University of Oxford, the Alan Turing Institute, Utrecht University, and the University of Sheffield found that the Perspective API particularly struggles with denouncements of hate that quote others' hate speech or make direct references to it. An earlier University of Washington study published in 2019 found that Perspective was more likely to label "Black-aligned English" offensive versus "white-aligned English."

For its part, Jigsaw recently told VentureBeat that it has made and continues to make progress toward mitigating the biases in its models.

The researchers say that their work highlights the shortcomings of AI when applied to the task of civility detection. While they believe that prejudices against groups like Muslims and African Americans can be lessened through "data-driven" techniques, they expect that correctly classifying edge cases like sarcasm will require the development of new systems.

"The work we presented was motivated by the desire to apply off-the-shelf methods for toxicity prediction to analyse civility in American news. These methods were developed to detect rude, disrespectful, or unreasonable comment that is likely to make you leave the discussion in an online forum," the coauthors wrote. "We find that Perspective's inability to differentiate levels of incivility is partly due to the spurious correlations it has formed between certain non-offensive words and incivility. Many of these words are identity-related. Our work will facilitate future research efforts on debiasing of automated predictions."