Study suggests that AI model selection might introduce bias

The past several years have established that AI and machine learning are not a panacea when it comes to fair outcomes. Applying algorithmic solutions to social problems can magnify biases against marginalized peoples, and undersampling populations always results in worse predictive accuracy. But bias in AI doesn't arise from datasets alone. Problem formulation, or the way researchers fit tasks to AI techniques, can also contribute. So can other human-led steps throughout the AI deployment pipeline.

To this end, a new study coauthored by researchers at Cornell and Brown University investigates the problems around model selection -- the process by which engineers choose machine learning models to deploy after training and validation. The team found that model selection presents another opportunity to introduce bias because the metrics used to distinguish between models are subject to interpretation and judgement.

In machine learning, a model is typically trained on a dataset and evaluated for a metric (e.g., accuracy) on a test dataset. To improve performance, the learning process can be repeated. Retraining until a satisfactory model is produced is what's known as a "researcher degree of freedom."

While researchers may report average performance across a small number of models, they often publish results using a specific set of variables that can obscure a model's true performance. This presents a challenge because other model properties can change during training. Seemingly minute differences in accuracy between groups can multiply out to large groups, impacting fairness with regard to specific demographics.

The coauthors underline a case study in which test subjects were asked to choose a "fair" skin cancer detection model based on metrics they identified. Overwhelmingly, the subjects selected a model with the highest accuracy even though it exhibited the largest gender disparity. This is problematic on its face, the researchers say, because the accuracy metric doesn't provide a breakdown of false positives (missing a cancer diagnosis) and false negatives (mistakenly diagnosing cancer when it's not actually present). Including these metrics could have biased the subjects to make different choices concerning which model was "best."

"The overarching point is that contextual information is highly important for model selection, particularly with regard to which metrics we choose to inform the selection decision," the coauthors of the study wrote. "Moreover, sub-population performance variability, where the sub-populations are split on protected attributes, can be a crucial part of that context, which in turn has implications for fairness."

Beyond model selection and problem formulation, research is beginning to shed light on the various ways humans might contribute to bias in models. For example, researchers at MIT found just over 2,900 errors arising from labeling mistakes in ImageNet, an image database used to train countless computer vision algorithms. A separate Columbia study concluded that biased algorithmic predictions are mostly caused by imbalanced data but that the demographics of engineers also play a role, with models created by less diverse teams generally faring worse.

In future work, the Cornell and Brown University team say they intend to see if they can ameliorate the issue of performance variability through "AutoML" methods, which divests the model selection process from human choice. But the research suggests new approaches might be needed to mitigate every human-originated source of bias.