The computer vision APIs offered by Google, Microsoft, and IBM exhibit gender bias when tested on self-portraits of people wearing partial face masks. That’s according to data scientists at marketing communications agency Wunderman Thompson, who found that popular computer vision services like Cloud Vision API and Azure Cognitive Services Computer Vision more often misidentify the kinds of masks worn during the pandemic as “duct tape” and “fashion accessories” on women as opposed to “beards” and “facial hair” on men.
Ilinca Barsan, director of data science at Wunderman Thompson, wasn’t looking for bias in commercial computer vision APIs. She had intended to build a tool that would allow users to connect to thousands of street cameras around the country and determine the proportion of pedestrians wearing masks at any given time. Google’s Cloud Vision API was supposed to power the tool’s mask detection component, providing labels for elements of images, along with confidence scores associated with those labels.
When Barsan uploaded a photo of herself wearing a mask to test Cloud Vision API’s accuracy, she noticed one unexpected label — “duct tape” — surfaced to the top with 96.57% confidence. (A high confidence score indicates the model believes the label is highly relevant to the image.) Donning a ruby-colored mask returned 87% confidence for “duct tape” and dropped the “mask” label — which had been 73.92% — from the list of labels. A blue surgical mask yielded “duct tape” once again (with a 66% confidence score) and failed to elicit the “mask” label for the second time.
Barsan took this as a sign of bias within the computer vision models underlying Cloud Vision API. She theorized they might be drawing on sexist portrayals of women in the data set on which they were trained — women who had perhaps been victims of violence.
It’s not an unreasonable assumption. Back in 2015, a software engineer pointed out that the image recognition algorithms in Google Photos were labeling his Black friends as “gorillas.” A University of Washington study found women were significantly underrepresented in Google Image searches for professions like “CEO.” More recently, nonprofit AlgorithmWatch showed Cloud Vision API automatically labeled a thermometer held by a dark-skinned person as a “gun” while labeling a thermometer held by a light-skinned person as an “electronic device.”
In response, Google says it adjusted the confidence scores to more accurately reflect when a firearm is in a photo. The company also removed the ability to label people in images as “man” or “woman” with Cloud Vision API because errors had violated Google’s AI principle of not creating biased systems.
To test whether Cloud Vision API might classify appearances differently for mask-wearing men versus mask-wearing women, Barsan and team solicited mask images from friends and colleagues, which they added to a data set of photos found on the web. The final corpora consisted of 265 images of men in masks and 265 images of women in masks in varying contexts, from outdoor pictures and office snapshots with DIY cotton masks to stock images and iPhone selfies showing N95 respirators.
According to Barsan, out of the 265 images of men in masks, Cloud Vision API correctly identified 36% as containing personal protective equipment (PPE) and seemed to make the association that something covering a man’s face was likely to be facial hair (27% of the images had the label “facial hair”). Around 15% of images were misclassified as “duct tape” with a 92% average confidence score, suggesting it might be an issue for both men and women. But out of the 265 images of women in masks, Cloud Vision API mistook 28% as depicting duct tape with an average confidence score of 93%. It returned “PPE” 19% of the time and “facial hair” 8% of the time.
“At almost twice the number for men, ‘duct tape’ was the single most common ‘bad guess’ for labeling masks [for women],” Barsan said. “The model certainly made an educated guess. Which begged the question — where exactly did it go to school?”
In a statement provided to VentureBeat, Cloud AI director of product strategy Tracy Frey said Google has reached out to Wunderman directly to learn more about the team’s research, methodology, and findings. “Fairness is one of our core AI principles, and we’re committed to making progress in this area. We’ve been working on the challenge of accurately detecting objects for several years, and will continue to do so,” Frey said. “In the last year, we’ve developed tools and data sets to help identify and reduce bias in machine learning models, and we offer these as open source for the larger community so their feedback can help us improve.”
Google isn’t the only vendor with apparent bias in its computer vision models. After testing Cloud Vision API, Barsan and team ran the same data set through IBM’s Watson Visual Recognition service, which returned the label “restraint chains” for 23% of the images of masked women (compared with 10% of the images of men) and “gag” for 23% (compared with 10% of the male images). Furthermore, Watson correctly identified 12% of the men to be wearing masks, while it was only right 5% of the time for the women.
The average confidence score for the “gag” label hovered around 79% for women compared to 75% for men, suggesting that Watson Visual Recognition was more hesitant than Cloud Vision API to assign those labels. IBM declined to comment, but it took issue with the way the data set was compiled, and a spokesperson says the company is conducting tests to find evidence of the bias Barsan claims to have uncovered.
In a final experiment, Barsan and colleagues tested Microsoft’s Azure Cognitive Services Computer Vision API, which two years ago received an update ostensibly improving its ability to recognize gender across different skin tones. The service struggled to correctly tag masks in pictures, correctly labeling only 9% of images of men and 5% of images of women as featuring a mask. And while it didn’t return labels like “duct tape,” “gag,” or “restraint,” Azure Cognitive Services identified masks as “fashion accessories” for 40% of images of women (versus only 13% of images of men), as “lipstick” for 14% of images of women, and as a beard for 12% of images of men.
Microsoft also declined to comment.
“In terms of research contribution or anything like that, it’s sort of repeating a point that’s been said,” Mike Cook, an AI researcher with a fellowship at Queen Mary University of London, told VentureBeat. “But it’s an interesting point … It made me think a lot about the myth of the ‘good’ data set. Honestly, I feel like some things just cannot hope to have data sets built around them without being hopelessly narrow or biased. It’s all very well to remove the ‘man’ label from a data set, but are there any photos of women with facial hair in that data set, or men with lipstick on? Probably not, because the data set reflects certain norms and expectations that are always aging and becoming less relevant.”
Barsan doesn’t believe the results to be indicative of malicious intent on the part of Google, IBM, or Microsoft, but she says this is yet another example of the prejudices that can emerge in unbalanced data sets and machine learning models. They have the potential to perpetuate harmful stereotypes, she says, reflecting a culture in which violence against women is often normalized and exploited.
“A simple image search of ‘duct tape man’ and ‘duct tape woman’ respectively revealed images of men mostly (though not exclusively) pictured in full-body duct tape partaking in funny pranks, while women predominantly appeared with their mouth duct-taped, many clearly in distress,” Barsan noted. When it came to masks, “Across the board, all three computer vision models performed poorly at the task at hand. However, they were consistently better at identifying masked men than women.”
That’s certainly not surprising in the context of computer vision, which countless studies have shown to be susceptible to bias. A study last fall by University of Colorado, Boulder researchers showed that AI from Amazon, Clarifai, Microsoft, and others maintained accuracy rates above 95% for cisgender men and women but misidentified trans men as women 38% of the time. Separate benchmarks of major vendors’ systems by the Gender Shades project and the National Institute of Standards and Technology (NIST) suggest that facial recognition technology exhibits racial and gender bias and facial recognition programs can be wildly inaccurate, misclassifying people upwards of 96% of the time.
“Beyond damage control and Band-Aid solutions, we must work diligently to ensure that the artificial intelligences we build have the full benefit of our own natural intelligence,” Barsan said. “If our machines are to work accurately and to responsibly reflect society, we must help them understand the social dynamics that we live in, to stop them from reinforcing existing inequalities through automation, and put them to work for good instead … After all, we’d quite like our street-cam analyzer to suggest that 56% of people on the street are staying safe — not being gagged and restrained.”
Via email, Barson later clarified that the street-cam analyzer project was “an internal hypothetical exercise” to provide feedback to people in high-risk categories regarding how safe it might be to go to public places. Out of concern over privacy implications and in light of the bias research she ended up conducting, Barson decided against pursuing it further.
How startups are scaling communication: The pandemic is making startups take a close look at ramping up their communication solutions. Learn how