Facebook dataset combats AI bias by having people self-identify age and gender

Facebook today open-sourced a dataset designed to surface age, gender, and skin tone biases in computer vision and audio machine learning models. The company claims that the corpus -- Casual Conversations -- is the first of its kind featuring paid people who explicitly provided their age and gender as opposed to labeling this information by third parties or estimating it using models.

Biases can make their way into the data used to train AI systems, amplifying stereotypes and leading to harmful consequences. Research has shown that state-of-the-art image-classifying AI models trained on ImageNet, a popular dataset containing photos scraped from the internet, automatically learn humanlike biases about race, gender, weight, and more. Countless studies have demonstrated that facial recognition is susceptible to bias. It's even been shown that prejudices can creep into the AI tools used to create art, potentially contributing to false perceptions about social, cultural, and political aspects of the past and hindering awareness about important historical events.

Casual Conversations, which contains over 4,100 videos of 3,000 participants, some from the Deepfake Detection Challenge, aims to combat this bias by including labels of "apparent" skin tone. Facebook says that the tones are estimated using the Fitzpatrick scale, a classification schema for skin color developed in 1975 by American dermatologist Thomas B. Fitzpatrick. The Fitzpatrick scale is a way to ballpark the response of types of skin to ultraviolet light, from Type I (pale skin that always burns and never tans) to Type VI (deeply pigmented skin that never burns).

Facebook says that it recruited trained annotators for Casual Conversations to determine which skin type each participant had. The annotators also labeled videos with ambient lighting conditions, which helped to measure how models treat people with different skin tones under low-light conditions.

A Facebook spokesperson told VentureBeat via email that a U.S. vendor was hired to select annotators for the project from "a range of backgrounds, ethnicity, and genders." The participants -- who hailed from Atlanta, Houston, Miami, New Orleans, and Richmond -- were paid.

"As a field, industry and academic experts alike are still in the early days of understanding fairness and bias when it comes to AI ... The AI research community can use Casual Conversations as one important stepping stone toward normalizing subgroup measurement and fairness research," Facebook wrote in a blog post. "With Casual Conversations, we hope to spur further research in this important, emerging field."

In support of Facebook's point, there's a body of evidence that computer vision models in particular are susceptible to harmful, pervasive prejudice. A paper last fall by University of Colorado, Boulder researchers demonstrated that AI from Amazon, Clarifai, Microsoft, and others maintained accuracy rates above 95% for cisgender men and women but misidentified trans men as women 38% of the time. Independent benchmarks of major vendors' systems by the Gender Shades project and the National Institute of Standards and Technology (NIST) have demonstrated that facial recognition technology exhibits racial and gender bias and have suggested that current facial recognition programs can be wildly inaccurate, misclassifying people upwards of 96% of the time.

Beyond facial recognition, features like Zoom's virtual backgrounds and Twitter's automatic photo-cropping tool have historically disfavored people with darker skin. Back in 2015, a software engineer pointed out that the image recognition algorithms in Google Photos were labeling his Black friends as "gorillas." And nonprofit AlgorithmWatch showed that Google's Cloud Vision API at once time automatically labeled a thermometer held by a dark-skinned person as a "gun" while labeling a thermometer held by a light-skinned person as an "electronic device."

Experts attribute many of these errors to flaws in the datasets used to train the models. One recent MIT-led audit of popular machine learning datasets found an average of 3.4% annotation errors, including one where a picture of a Chihuahua was labeled "feather boa." An earlier version of ImageNet, a dataset used to train AI systems around the world, was found to contain photos of naked children, porn actresses, college parties, and more -- all scraped from the web without those individuals' consent. Another computer vision corpus, 80 Million Tiny Images, was found to have a range of racist, sexist, and otherwise offensive annotations, such as nearly 2,000 images labeled with the N-word, and labels like "rape suspect" and "child molester."

But Casual Conversations is far from a perfect benchmark. Facebook says it didn't collect information about where the participants are originally from. And in asking their gender, the company only provided the choices "male," "female," and "other" -- leaving out genders like those who identify as nonbinary.

The spokesperson also clarified that Casual Conversations is available to Facebook teams only as of today and that employees won't be required -- but will be encouraged -- to use it for evaluation purposes.

Exposés about Facebook's approaches to fairness haven't done much to engender trust within the AI community. A New York University study published in July 2020 estimated that Facebook's machine learning systems make about 300,000 content moderation mistakes per day, and problematic posts continue to slip through Facebook's filters. In one Facebook group that was created last November and rapidly grew to nearly 400,000 people, members calling for a nationwide recount of the 2020 U.S. presidential election swapped unfounded accusations about alleged election fraud and state vote counts every few seconds.

For Facebook's part, the company says that while it considers Casual Conversations a "good, bold" first step, it'll continue pushing toward developing techniques that capture greater diversity over the next year or so. "In the next year or so, we hope to explore pathways to expand this data set to be even more inclusive with representations that include more geographical locations, activities, and a wider range of gender identities and ages, the spokesperson said. "It’s too soon to comment on future stakeholder participation, but we’re certainly open to speaking with stakeholders in the tech industry, academia, researchers, and others."

More