Researchers claim the data sets often used to train AI systems to detect expressions like happiness, anger, and surprise are biased against certain demographic groups. In a preprint study published on Arxiv.org, coauthors affiliated with the University of Cambridge and Middle East Technical University find evidence of skew in two open source corpora: Real-world Affective Faces Database (RAF-DB) and CelebA.

Machine learning algorithms become biased in part because they’re provided training samples that optimize their objectives toward majority groups. Unless explicitly modified, they perform worse for minority groups — i.e., people represented by fewer samples. In domains like facial expression classification, it’s difficult to compensate for skew because the training sets rarely contain information about attributes like race, gender, and age. But even those that do provide attributes are typically unevenly distributed.

RAF-DB contains tens of thousands of images from the internet with facial expression and attribute annotations, while CelebA has over 202,599 images of 10,177 people with 40 types of attribute annotations. To determine the extent to which bias existed in either, researchers sampled a random subset and aligned and cropped the images so the faces were consistent with respect to orientation. Then, they used classifiers to measure the accuracy (the fraction of the predictions the model got correct) and fairness (whether the classifier was fair to attributes like gender, age, and ethnicity) — the idea being that the classifiers should provide similar results across different demographic groups.

In the subset of images from RAF-DB, the researchers report the vast majority of subjects — 77.4% — were Caucasian, while 15.5% were Asian and only 7.1% were African American. The subset showed gender skew as well, with 56.3% females and 43.7% male subjects. Accuracy unsurprisingly ranged from low for some minority groups (59.1% for Asian females and 61.6% for African American females) to high for majorities (65.3% for Caucasian males), and on the fairness metric, the researchers found it to be low for race (88.1%) but high overall for gender (97.3%).

On the CelebA subset, the researchers trained a simpler classifier to distinguish between two classes of people: smiling and non-smiling. They note that the data set had substantial skew, with only 38.6% of not-smiling males compared with 61.4% of not-smiling females. The classifier was 93.7% accurate for younger females but less so for older males (90.7%) and females (92.1%) as a result of this, which while not statistically significant is an indication of poor distribution, according to the researchers.

“To date, there exists a large variety and number of data sets for facial expression recognition tasks. However, virtually none of these data sets have been acquired with consideration of containing images and videos that are evenly distributed across the human population in terms of sensitive attributes such as gender, age and ethnicity,” the coauthors wrote.

The evident bias in facial expression data sets underlines the need for regulation, many would argue. At least one AI startup specializing in affect recognition — Emteq — has called for laws to prevent misuse of the tech. A study commissioned by the Association for Psychological Science noted that because emotions are expressed in a range of ways, it’s hard to infer how someone feels from their expressions. And the AI Now Institute, a research institute based at New York University studying AI’s impact on society, warned in a 2019 report that facial expression classifiers were being unethically used to make hiring decisions and set insurance prices.

“At the same time as these technologies are being rolled out, large numbers of studies are showing that there is … no substantial evidence that people have this consistent relationship between the emotion that you are feeling and the way that your face looks,” AI Now cofounder Kate Crawford told the BBC in a recent interview.