Researchers find evidence of racial, gender, and socioeconomic bias in chest X-ray classifiers

Google and startups like Qure.ai, Aidoc, and DarwinAI are developing AI and machine learning systems that classify chest X-rays to help identify conditions like fractures and collapsed lungs. Several hospitals, including Mount Sinai, have piloted computer vision algorithms that analyze scans from patients with the novel coronavirus. But research from the University of Toronto, the Vector Institute, and MIT reveals that chest X-ray datasets used to train diagnostic models exhibit imbalance, biasing them against certain gender, socioeconomic, and racial groups.

Partly due to a reticence to release code, datasets, and techniques, much of the data used today to train AI algorithms for diagnosing diseases may perpetuate inequalities. A team of U.K. scientists found that almost all eye disease datasets come from patients in North America, Europe, and China, meaning eye disease-diagnosing algorithms are less certain to work well for racial groups from underrepresented countries. In another study, Stanford University researchers claimed that most of the U.S. data for studies involving medical uses of AI come from California, New York, and Massachusetts. A study of a UnitedHealth Group algorithm determined that it could underestimate by half the number of Black patients in need of greater care. And a growing body of work suggests that skin cancer-detecting algorithms tend to be less precise when used on Black patients, in part because AI models are trained mostly on images of light-skinned patients.

The coauthors of this newest paper sought to determine whether state-of-the-art AI classifiers trained on public medical imaging datasets were fair across different patient subgroups. They specifically looked at MIMIC-CXR (which contains over 370,000 images), Stanford's CheXpert (over 223,000 images), the U.S. National Institutes of Health's Chest-Xray (over 112,000 images), and an aggregate of all three, whose scans from over 129,000 patients combined are labeled with the sex and age range of each patient. MIMIC-CXR also has race and insurance type data; excluding 100,000 images, the dataset specifies whether patients are Asian, Black, Hispanic, white, Native American, or other and if they're on Medicare, Medicaid, or private insurance.

After feeding the classifiers the datasets to demonstrate they reached near-state-of-the-art classification performance, which ruled out the possibility that any disparities simply reflected poor overall performance, the researchers calculated and identified disparities across the labels, datasets, and attributes. They found that all four datasets contained "meaningful" patterns of bias and imbalance, with female patients suffering from the highest disparity despite the fact the proportion of women was only slightly less than men. White patients -- the majority, with 67.6% of all the X-ray images -- were the most-favored subgroup, where Hispanic patients were the least-favored. And bias existed against patients with Medicaid insurance, the minority population with only 8.98% of X-ray images. The classifiers often provided Medicaid patients with incorrect diagnoses.

The researchers note that their study has limitations arising from the nature of the labels in the datasets. Each label was extracted from radiology reports using natural language processing techniques, meaning a portion of them could have been erroneous. The coauthors also concede that the quality of the imaging devices themselves, the region of the data collection, and the patient demographics at each collection site might have confounded the results.

However, they assert that even the implication of bias is enough to warrant a closer look at the datasets and any models trained on them. "Subgroups with chronic underdiagnosis are those who experience more negative social determinants of health, specifically, women, minorities, and those of low socioeconomic status. Such patients may use healthcare services less than others," the researchers wrote. "There are a number of reasons why datasets may induce disparities in algorithms, from imbalanced datasets to differences in statistical noise in each group to differences in access to healthcare for patients of different groups ... Although 'de-biasing' techniques may reduce disparities, we should not ignore the important biases inherent in existent large public datasets."

Beyond basic dataset challenges, classifiers lacking sufficient peer review can encounter unforeseen roadblocks when deployed in the real world. Scientists at Harvard found that algorithms trained to recognize and classify CT scans could become biased to scan formats from certain CT machine manufacturers. Meanwhile, a Google-published whitepaper revealed challenges in implementing an eye disease-predicting system in Thailand hospitals, including issues with scan accuracy. And studies conducted by companies like Babylon Health, a well-funded telemedicine startup that claims to be able to triage a range of diseases from text messages, have been repeatedly called into question.

The researchers of this study recommend that practitioners apply "rigorous" fairness analyses before deployment as one solution to bias. They also suggest that clear disclaimers about the dataset collection process and the potential resulting algorithmic bias could improve assessments for clinical use.

More