Researchers propose problematic method of using synthetic faces to test algorithms for gender and race bias

In a paper uploaded onto the preprint server Arxiv.org, researchers at the Massachusetts Institute of Technology, California Institute of Technology, and Amazon Web Services propose a controversial method for measuring the algorithmic bias of facial analysis algorithms. They claim that when applied to a face database of members of international parliaments, it identified "significant" imbalances across attributes like age, hair length, and facial hair -- but not skin color.

"While I appreciate that the coauthors made an attempt to identify the clear shortcomings with regard to how they treat complex concepts like race and gender, in the end it feels more like a dodge than truly addressing any of the real problems underlying the pursuit of reducing bias in AI," Liz O'Sullivan, cofounder and technology director of the Surveillance Technology Oversight Project, told VentureBeat via email. "Casually attempting to transform images from one race and gender to another strikes me as high-tech blackface, a technique that feels completely tone-deaf given the current environment."

The researchers' conclusion conflicts with a landmark work published by Google AI ethicist Timnit Gebru and AI Now Institute research Deborah Raji that found facial analysis systems from Amazon, IBM, Face++, and Microsoft perform best for white men and worst for women with darker skin. A separate 2012 study showed algorithms from vendor Cognitec were 5% to 10% less accurate at recognizing Black people. In 2011, a study suggested facial algorithms developed in China, Japan, and South Korea had difficulty distinguishing between Caucasian faces and East Asians. And more recently, MIT Media Lab researchers found that Microsoft, IBM, and Megvii facial analysis software misidentified gender in up to 7% of lighter-skinned females, up to 12% of darker-skinned males, and up to 35% in darker-skinned females.

Why the disparity? Unlike previous work, this new approach relies on images generated by Nvidia's StyleGAN2 instead of photos of real people. It alters multiple facial attributes at a time to produce samples of test images called "transects," which the coauthors quantify with detailed annotations that can be compared with a facial analysis algorithm's predictions in order to measure bias.

The researchers argue transects allow them to predict bias in "new scenarios" while "greatly reducing" ethical and legal challenges. But critics like O'Sullivan take issue with any attempt to improve a technology that could victimize those it identifies. "This research seeks to make facial recognition work better on dark faces, which will in turn be used disproportionately to surveil and incarcerate people with darker skin," she said. "Data bias is only one facet of the problems that exist with facial recognition technology."

Human annotators were recruited through Amazon Mechanical Turk and told to evaluate StyleGAN2-generated images for seven attributes: gender, facial hair, skin color, age, makeup, smiling, hair length, and image fakeness. (Images with a fakeness score above a certain threshold were removed.) Five annotations per attribute were collected for a total of 40 annotations per image, and the researchers report the standard deviation for most was low (near 0.1), indicating "good agreement" between the annotators.

To test their method, the coauthors benchmarked the Pilot Parliaments Benchmark, a database of faces of parliament members from nations around the world created with the goal of balancing gender and skin color. They trained two "research-grade" gender classifier models -- one on the publicly available CelebA data set of celebrity faces and the other on FairFace, a face corpus balanced for race, gender, and age -- and used a pretrained StyleGAN2 model to synthesize faces for transects.

After analyzing 5,335 images from the Pilot Parliaments Benchmark for bias, the researchers found skin color to be "not significant" in determining the classifiers' predictive bias. If hair length wasn't controlled for, a bias toward assigning gender on the basis of hair length might be read as a bias concerning dark-skinned women, they said. But in their estimation, skin color had a "negligible" effect compared with a person's facial hair, gender, makeup, hair length, and age.

The researchers admit to flaws in their technique, however. StyleGAN2 often adds facial hair to male faces when it increases hair length on a transect, potentially resulting in lower classifier error rates for males with longer hair. It also tends to add earrings to a transect when it's modifying a dark-skinned face to look female; depending on culture, earrings could affect the presumption of a person's gender. Moreover, many of the generated faces contained visible artifacts that could have affected the classifiers' predictions, which the annotators attempted to eliminate but might have missed.

"The GAN used to create the fake images is guaranteed also to be biased," O'Sullivan said. "You can’t sidestep that question by saying 'our model is only trying to predict human perception of race.' Race is more than just skin color. Whatever data biases exist in the GAN will be transferred into the analysis of new models. If the GAN is creating new faces based on training data of mainly White people, then other facial features that may be more common on Black faces (aside from skin color) will fail to be represented in the fake faces that the GAN generates, meaning the analysis of bias may not be generalizable to the population of real Black people in the world."

Others believe there's a limit to the extent that bias can be mitigated in facial analysis and recognition technologies. (Indeed, facial recognition programs can be wildly inaccurate; in one case, they may be misclassifying people upwards of 96% of the time.) The Association for Computing Machinery (ACM) and American Civil Liberties Union (ACLU) continue to call for moratoriums on all forms of the technology. San Francisco and Oakland in California, as well as Boston and five other Massachusetts communities, have banned police use of facial recognition technology. And after the first wave of recent Black Lives Matter protests in the U.S., companies including Amazon, IBM, and Microsoft halted or ended the sale of facial recognition products.