Training AI algorithms on mostly smiling faces reduces accuracy and introduces bias, according to research

Facial recognition systems are problematic for a number of reasons, not least of which they tend to exhibit prejudice against certain demographic groups and genders. But a new study from researchers affiliated with MIT, the Universitat Oberta de Catalunya in Barcelona, and the Universidad Autonoma de Madrid explores another problematic aspect that's received less attention so far: bias toward certain facial expressions. The coauthors claim that the impact of expressions on facial recognition systems is "at least" as impactful as wearing a scarf, hat, wig, or glasses, and that facial recognition systems are trained with highly biased datasets in this regard.

The study adds to a growing body of evidence that facial recognition is susceptible to harmful, pervasive prejudice. A paper last fall by University of Colorado, Boulder researchers demonstrated that AI from Amazon, Clarifai, Microsoft, and others maintained accuracy rates above 95% for cisgender men and women but misidentified trans men as women 38% of the time. Independent benchmarks of major vendors' systems by the Gender Shades project and the National Institute of Standards and Technology (NIST) have demonstrated that facial recognition technology exhibits racial and gender bias and have suggested that current facial recognition programs can be wildly inaccurate, misclassifying people upwards of 96% of the time.

In the course of their research, the coauthors carried out experiments using three different leading facial recognition models trained on open source databases including VGGFace2 (a dataset spanning over 3 million images of more than 9,100 people) and MS1M-ArcFace (which has over 5.8 million images of 85,000 people). They benchmarked them against four corpora, specifically:

The Compound Facial Expression of Emotion, which contains photographs of 230 people captured in a lab-controlled environment.
The Extended Cohn-Kanade (CK+), one of the most widely used databases for training and evaluating face expression recognition systems, with 593 sequences of photos of 123 people.
CelebA, a large-scale face attribute dataset comprising 200,000 images of 10,000 celebrities.
MS-Celeb-1M, a publicly available face recognition benchmark and dataset released in 2016 by Microsoft containing nearly 10 million images of 1 million celebrities.

As the researchers note, academics and corporations have long scraped facial photographs from sources like the web, movies, and social media to address the problem of model training data scarcity. Like most machine learning models, facial recognition models require large amounts of data to achieve a baseline level of accuracy. But these sources of data are generally unbalanced, as it turns out, because some facial expressions are less common than others. For example, people tend to share more happy faces than sad ones on Facebook, Twitter, and Instagram.

To classify images from their four benchmark corpora by expression, the researchers used software from Affectiva that recognizes up to 7 facial expressions: 6 basic emotions plus neutral face. They found that the proportion of "neutral" images exceeded 60% across all datasets, reaching 83.7% in MS-Celeb-1M. The second-most common facial expression was "happy"; for all the datasets, around 90% of the images showed an either "neutral" or "happy" person. As for the other 5 facial expressions, "surprised" and "disgusted" rarely exceed 6% while "sad," "fear," and "anger" had very low representations (often below 1%).

The results varied by gender, too. In VGGFace2, the number of "happy" women was almost twice the number of "happy" men.

"This remarkable under-representation of some facial expressions in the datasets produces ... drawbacks," the coauthors wrote in a paper describing their work. "On the one hand, models are trained using highly biased data that result in heterogeneous performances. On the other hand, technology is evaluated only for mainstream expressions hiding its real performance for images with some specific facial expressions ... [In addition, the] gender bias is important because it might cause different performances for both genders."

The researchers next conducted an analysis to determine the extent to which the facial expression biases in example sets like CelebA might have an impact on the predictions of facial recognition systems. Across all three of the aforementioned algorithms, performance was better on faces showing "neutral" or "happy" expressions, the most common expressions in the training databases.

The study's findings suggest that differences in facial expressions can't fool systems into misidentifying a person as someone else. However, they also imply that facial expression biases result in variations between a system's "genuine" comparison scores -- scores measuring the ability of an algorithm to discern between images of the same face -- upwards of 40%.

The researchers only used Affectiva's software to classify emotions, which might have introduced unintended bias during their experiments, and they didn't test any commercially deployed systems like Amazon's Rekognition, Google Cloud's Vision API, or Microsoft Azure's Face API. Nonetheless, they advocate for reducing the facial expression bias in future face recognition databases and further developing bias-reduction methods applicable to existing databases and models already trained on problematic datasets.

"The lack of diversity in facial expressions in face databases intended for development and evaluation of face recognition systems represents, among other disadvantages, a security vulnerability of the resulting systems," the coauthors wrote. "Small changes in facial expression can easily mislead the face recognition systems developed around those biased databases. Facial expressions have an impact on the matching scores computed by a face recognition system. This effect can be exploited as a possible vulnerability, reducing the probabilities to be matched."

More