Researchers find that labels in computer vision datasets poorly capture racial diversity

Datasets are a primary driver of progress in computer vision, and many computer vision applications require datasets that include human faces. These datasets often have labels denoting racial identity, expressed as a category assigned to faces. But historically, little attention has been paid to the validity, construction, and stability of these categories. Race is an abstract, fuzzy notion, and highly consistent representations of a racial group across datasets could be indicative of stereotyping.

Northeastern University researchers sought to study these face labels in the context of racial categories and fair AI. In a paper, they argue that labels are unreliable as indicators of identity because some labels are more consistently defined than others and because datasets appear to "systematically" encode stereotypes of racial categories.

Their timely research comes after Deborah Raji and coauthor Genevieve Fried published a pivotal study examining facial recognition datasets compiled over 43 years. The authors found that researchers, driven by the exploding data requirements of machine learning, gradually stopped asking for people's consent, leading these researchers to unintentionally include photos of minors, racist and sexist labels, and inconsistent quality and lighting.

Racial labels are used in computer vision without definition or with only loose and nebulous definitions, the researchers observed from datasets they analyzed (FairFace, BFW, RFW, and LAOFIW). There are many systems of racial classifications and terminology, some of debatable coherence, with one dataset grouping together "people with ancestral origins in Sub-Saharan Africa, India, Bangladesh, Bhutan, among others." Other datasets use labels that are widely considered offensive, like "Mongoloid."

Moreover, a number of computer vision datasets use the label "Indian/South Asian," which the researchers point to as an example of the pitfalls of racial categories. If the "Indian" label refers only to the country of India, it's arbitrary in the sense that the borders of India represent the partitioning of a colonial empire on political grounds. Indeed, racial labels largely correspond with geographic regions, including populations with a range of languages, cultures, separation in space and time, and phenotypes. Labels like "South Asian" should include populations in Northeast India that might exhibit traits more common in East Asia, but ethnic groups span racial lines, and labels can fractionalize them, placing some members in one racial category and others in another.

"The often employed, standard set of racial categories -- e.g., 'Asian,' 'Black,' 'White,' 'South Asian' -- is, at a glance, incapable of representing a substantial number of humans," the researchers wrote. "It obviously excludes indigenous peoples of the Americas, and it is unclear where the hundreds of millions of people who live in the Near East, Middle East, or North Africa should be placed. One can consider extending the number of racial categories used, but racial categories will always be incapable of expressing multiracial individuals, or racially ambiguous individuals. National origin or ethnic origin can be utilized, but the borders of countries are often the results of historical circumstance and don't reflect differences in appearance, and many countries are not racially homogeneous."

Equally problematic, the researchers found that annotators exhibited disagreement when it came to faces in the datasets they analyzed. All datasets seemed to include and recognize a very specific type of person as Black -- a stereotype -- while having more expansive (and less consistent) definitions for other racial categories. Furthermore, the consistency of racial perception varied across ethnic groups, with Filipinos in one dataset less consistently seen as Asian compared with Koreans, for example.

"It is possible to explain some of the results purely probabilistically -- blonde hair is relatively uncommon outside of Northern Europe, so blond hair is a strong signal of being from Northern Europe, and thus, belonging to the White category. But If the datasets are biased toward images collected from individuals in the U.S., then East Africans may not be included in the datasets, which results in high disagreement on the racial label to assign to Ethiopians relative to the low disagreement on the Black racial category in general," the researchers explained.

These racial labeling biases could be reproduced and amplified if left unaddressed, the researchers warn, and take on validity with dangerous consequences when divorced from cultural context. Indeed, numerous studies -- including the landmark Gender Shades work by Raji, Joy Buolamwini, Dr. Timnit Gebru, Dr. Helen Raynham -- and VentureBeat's own analyses of public benchmark data have shown facial recognition algorithms are susceptible to various biases. One frequent confounder is technology and techniques that favor lighter skin, which include everything from sepia-tinged film to low-contrast digital cameras. These prejudices can be encoded in algorithms such that their performance on people with darker skin falls short of their performance on those with lighter skin.

"A dataset can have equal amounts of individuals across racial categories, but exclude ethnicities or individuals who don't fit into stereotypes," the researchers wrote. "It is tempting to believe fairness can be purely mathematical and independent of the categories used to construct groups, but measuring the fairness of systems in practice, or understanding the impact of computer vision in relation to the physical world, necessarily requires references to groups which exist in the real world, however loosely."