As artificial intelligence becomes more advanced and capable at tackling complex tasks, we are increasingly trusting AI algorithms to make critical decisions. But how fair are those decisions? Experience in the past few years shows AI algorithms can manifest gender and racial bias, raising concern over their use in critical domains, such as deciding whose loan gets approved, who’s qualified for a job, who gets to walk free and who stays in prison.

New research by scientists at Boston University shows just how hard it is to evaluate fairness in AI algorithms and tries to establish a framework for detecting and mitigating problematic behavior in automated decisions. Titled “From Soft Classifiers to Hard Decisions: How fair can we be?,” the research paper is being presented this week at the Association for Computing Machinery conference on Fairness, Accountability, and Transparency (ACM FAT*).

The work of BU builds on work done in recent years to document and mitigate algorithmic bias. Most of efforts in the space focus on examining how automated systems affect different groups of people and whether they treat those groups equally. One challenge in this field is the fact that the rules and metrics for evaluating AI fairness are not clear cut.

“One of the things that caught our attention early on was the fact that in many settings there are different notions of fairness that all seem reasonable but fundamentally incompatible. You just cannot have a system that satisfies all of them,” Adam Smith, Professor of Computer Science and Engineering at Boston University and co-author of the paper, told me.

For their research, the BU scientists used data released from a famous 2016 ProPublica investigation into COMPAS, an automated recidivism assessment software. ProPublica concluded that COMPAS showed bias against African American defendants, associating them with higher risk scores and giving them harsher prison sentences.

Most automated decision-making systems are composed of two components. The first is a calibrated scoring classifier that assigns a number between zero and one to each specific case. The system usually uses machine learning to analyze different data points and to provide an output that best matches the patterns it has previously seen. In the case of recidivism systems, the output of the scoring system is the probability that a defendant might commit crimes if released from jail.

The second component is a binary post-processing system that transforms the risk score into a yes/no output. In recidivism, the post-processing component decides whether the defendant stays in jail or walks free based on thresholds and ranges. For instance, if a risk score is above 0.5, the defendant is considered high-risk and their jail sentence will be extended. Experience shows that using the same thresholds for all populations can result in unfair decisions.

“We were curious to know whether, if risk scores are somehow behaving similarly on two populations, can they be used to make decisions that are fair or unbiased? The short answer is no. Even if your risk score satisfies these notions of calibration, it still may be impossible to make decisions based on that score that satisfy various equalized error criteria,” Smith said.

The BU researchers developed different approaches to evaluating and processing risk scores that can help reduce false positives and false negatives and show more fairness toward different groups of people.

The first part of their technique involves configuring the decision-maker and classifier separately for different populations. For instance, if a specific demographic is more vulnerable to biased decisions because of historical injustices that find their way into the training of the classifier, the system’s threshold is adjusted to reduce the risks for that specific population. The paper has examples that show using different decision rules for different groups will produce better outcomes in terms of equalizing error rates.

“What people are finding is that it’s really tricky to decide what constitutes fairness in these types of contexts. Very often, allowing the classifier to be aware of the protected-group status actually leads to better overall outcomes,” Smith said.

The second technique the BU researchers introduced was to change the automated decision-maker to declare some cases as uncertain, especially when the risk scores are close to decision thresholds.

“In the technique we developed, we allow the post-processing algorithm to introduce uncertainty. Sometimes, it can say that it can’t decide on a particular case,” Ran Canetti, Professor of Computer Science at Boston University and co-author of the research paper, told me.

When the decision-maker system flags a case as uncertain, it will go through a different decision process, which can be manual or handled by a different automated system. For instance, in the case of recidivism, if the system concludes it can’t decide on a specific defendant, the case will be deferred to a panel of judges or go through a more thorough investigation.

“These are partial decision algorithms. They take a numerical score, sometimes they output a decision, and sometimes they output ‘Sorry. I can’t decide on that.’ But when they do output a decision, they can be fair,” Canetti said.

The BU researchers tested the approach on the data released from the COMPAS investigation. They found that only 75 percent of the cases got a correct automated decision; the rest needed to be deferred to a separate process.

The researchers explained that while this approach will help mitigate unfairness and bias in automated decisions, it’s still not a perfect solution.

For instance, if the two-stage process costs the people who will be affected by it a lot, if it involves many more months of investigation or time off work or something invasive, then it will introduce its own kind of bias and disadvantage toward that population.

That is one of the topics the BU team will investigate in the next stages of their work. “We try to lay out different prototypical settings where this might come up and talk about different factors you might want to take into account when using this type of deferral decision-making process,” Smith said.

While the team used recidivism as the test-bed for their project, the results of their work can be applied to other domains where algorithmic bias is a problem. One area that can benefit from such techniques is recruitment. In October, Amazon had to shut down its automated recruitment engine because it was biased against women. Other domains include loans and education, where people might be disadvantaged because of their ethnic and racial backgrounds. With tools that give more visibility into the fairness of AI algorithms, people who use these systems will have a better grasp of when they can or can’t trust them.

As far as the researchers are concerned, their work is not finished. “There are still many more things to be said, and we’re still trying to formulate what the right questions are. Our goal is to give people a language and a set of mathematical tools for reasoning about these complicated, multi-staged systems in a way that is not possible right now because we don’t have the conceptual framework, we don’t have the terminology, and we don’t have the math,” Smith said.

Ben Dickson is a software engineer and the founder of TechTalks, a blog that explores the ways technology is solving and creating problems.