Researchers claim bias in AI named entity recognition models

Twitter researchers claim to have found evidence of demographic bias in named entity recognition, the first step toward generating automated knowledge bases, or the repositories leveraged by services like search engines. They say their analysis reveals AI performs better at identifying names from specific groups, and the biases manifest in syntax, semantics, and how word uses vary across linguistic contexts.

Knowledge bases are essentially databases containing information about entities -- people, places, and things. In 2012, Google launched a knowledge base, the Knowledge Graph, to enhance search results with hundreds of billions of facts gathered from sources including Wikipedia, Wikidata, and CIA World Factbook. Microsoft provides a knowledge base with over 150,000 articles created by support professionals who have resolved issues for its customers. But while the usefulness of knowledge bases is not in dispute, the researchers assert the embeddings used to represent entities in them exhibit bias against certain groups of people.

To show and quantify this bias, the coauthors evaluated popular named entity recognition models and off-the-shelf models from commonly used natural language processing libraries, including GloVe, CNET, ELMo, SpaCy, and StanfordNLP, on a synthetically generated test corpus. They performed inference with various models on the test data set to extract people's names and measure the respective accuracy and confidence of the correctly extracted names, repeating the experiment with and without capitalization of the names.

The name collection consisted of 123 names across eight different racial, ethnic, and gender groups (e.g., Black, white, Hispanic, Muslim, male, female). Each demographic was represented in the collection by upwards of 15 "salient" names, coming from popular names registered in Massachusetts between 1974 and 1979 (which have historically been used to study algorithmic bias) and from the ConceptNet project, a semantic network designed to help algorithms understand the meanings of words. The researchers used these to generate over 217 million synthetic sentences with templates from the Winogender Schemas project (which was originally designed to identify gender bias in automated systems), combined with 289 sentences from a "more realistic" data set for added robustness.

The results of the experiment show accuracy was highest on male and female white names across all models except ELMo, which extracted Muslim male names with the highest accuracy, and that a larger percentage of white names had higher model confidences compared with non-white names. For example, while GloVe was only 81% accurate for Muslim female names, it was 89% accurate for white female names. CNET was only 70% accurate for Black female names, but 96% accurate for white male names.

The researchers say the performance gap is partially attributable to bias in the training data, which contains "significantly" more male names than female names and white names than non-white names. But they also argue the work sheds light on the uneven accuracy of named entity recognition systems with names in categories like gender and race, which they further claim is important because named entity recognition supports not only knowledge bases but question-answering systems and search result ranking.

"We are aware that our work is limited by the availability of names from various demographics and we acknowledge that individuals will not necessarily identity themselves with the demographics attached to their first name, as done in this work ... However, if named entities from certain parts of the populations are systematically misidentified or mislabeled, the damage will be twofold: they will not be able to benefit from online exposure as much as they would have if they belonged to a different category and they will be less likely to be included in future iterations of training data, therefore perpetuating the vicious cycle," the researchers wrote. "While a lot of research in bias has focused on just one aspect of demographics (i.e. only race or only gender) our work focuses on the intersectionality of both these factors ... Our work can be extended to other named entity categories like location, and organizations from different countries so as to assess the bias in identifying these entities."

In future work, the researchers plan to investigate whether models trained in other languages also show favoritism toward named entities more likely to be used in cultures where that language is popular. They believe that this could lead to an assessment of named entity recognition models in different languages, with named entities ideally representing a larger demographic diversity.

More