Studies find bias in AI models that recommend treatments and diagnose diseases

Research into AI- and machine learning model-driven methods for health care suggests that they hold promise in the areas of phenotype classification, mortality and length-of-stay prediction, and intervention recommendation. But models have traditionally been treated as black boxes in the sense that the rationale behind their suggestions isn't explained or justified. This lack of interpretability, in addition to bias in their training datasets, threatens to hinder the effectiveness of these technologies in critical care.

Two studies published this week underline the challenges yet to be overcome when applying AI to point-of-care settings. In the first, researchers at the University of Southern California, Los Angeles evaluated the fairness of models trained with Medical Information Mart for Intensive Care IV (MIMIC-IV), the largest publicly available medical records dataset. The other, which was coauthored by scientists at Queen Mary University, explores the technical barriers for training unbiased health care models. Both arrive at the conclusion that ostensibly "fair" models designed to diagnose illnesses and recommend treatments are susceptible to unintended and undesirable racial and gender prejudices.

As the University of Southern California researchers note, MIMIC-IV contains the de-identified data of 383,220 patients admitted to an intensive care unit (ICU) or the emergency department at Beth Israel Deaconess Medical Center in Boston, Massachusetts between 2008 and 2019. The coauthors focused on a subset of 43,005 ICU stays, filtering out patients younger than 15 years old who hadn't visited the ICU more than once or who stayed less than 24 hours. Represented among the samples were married or single male and female Asian, Black, Hispanic, and white hospital patients with Medicaid, Medicare, or private insurance.

In one of several experiments to determine to what extent bias might exist in the MIMIC-IV subset, the researchers trained a model to recommend one of five categories of mechanical ventilation. Alarmingly, they found that the model's suggestions varied across different ethnic groups. Black and Hispanic cohorts were less likely to receive ventilation treatments, on average, while also receiving a shorter treatment duration.

Insurance status also appeared to have played a role in the ventilator treatment model's decision-making, according to the researchers. Privately insured patients tended to receive longer and more ventilation treatments compared with Medicare and Medicaid patients, presumably because patients with generous insurance could afford better treatment.

The researchers caution that there exist "multiple confounders" in MIMIC-IV that might have led to the bias in ventilator predictions. However, they point to this as motivation for a closer look at models in health care and the datasets used to train them.

In the study published by Queen Mary University researchers, the focus was on the fairness of medical image classification. Using CheXpert, a benchmark dataset for chest X-ray analysis comprising 224,316 annotated radiographs, the coauthors trained a model to predict one of five pathologies from a single image. They then looked for imbalances in the predictions the model gave for male versus female patients.

Prior to training the model, the researchers implemented three types of "regularizers" intended to reduce bias. This had the opposite of the intended effect -- when trained with the regularizers, the model was even less fair than when trained without regularizers. The researchers note that one regularizer, an "equal loss" regularizer, achieved better parity between males and females. This parity came at the cost of increased disparity in predictions among age groups, though.

"Models can easily overfit the training data and thus give a false sense of fairness during training which does not generalize to the test set," the researchers wrote. "Our results outline some of the limitations of current train time interventions for fairness in deep learning."

The two studies build on previous research showing pervasive bias in predictive health care models. Due to a reticence to release code, datasets, and techniques, much of the data used to train algorithms for diagnosing and treating diseases might perpetuate inequalities.

Recently, a team of U.K. scientists found that almost all eye disease datasets come from patients in North America, Europe, and China, meaning eye disease-diagnosing algorithms are less certain to work well for racial groups from underrepresented countries. In another study, Stanford University researchers claimed that most of the U.S. data for studies involving medical uses of AI come from California, New York, and Massachusetts. A study of a UnitedHealth Group algorithm determined that it could underestimate by half the number of Black patients in need of greater care. Researchers from the University of Toronto, the Vector Institute, and MIT showed that widely used chest X-ray datasets encode racial, gender, and socioeconomic bias. And a growing body of work suggests that skin cancer-detecting algorithms tend to be less precise when used on Black patients, in part because AI models are trained mostly on images of light-skinned patients.

Bias isn't an easy problem to solve, but the coauthors of one recent study recommend that health care practitioners apply "rigorous" fairness analyses prior to deployment as one solution. They also suggest that clear disclaimers about the dataset collection process and the potential resulting bias could improve assessments for clinical use.

More