A paper coauthored by over 112 researchers across 160 data and social science teams found that AI and statistical models, when used to predict six life outcomes for children, parents, and households, weren’t very accurate even when trained on 13,000 data points from over 4,000 families. They assert that the work is a cautionary tale on the use of predictive modeling, especially in the criminal justice system and social support programs.

“Here’s a setting where we have hundreds of participants and a rich data set, and even the best AI results are still not accurate,” said study co-lead author Matt Salganik, a professor of sociology at Princeton and interim director of the Center for Information Technology Policy at the Woodrow Wilson School of Public and International Affairs. “These results show us that machine learning isn’t magic; there are clearly other factors at play when it comes to predicting the life course.”

Fragile Families Study

The study, which was published this week in the journal Proceedings of the National Academy of Sciences, is the fruit of the Fragile Families Challenge, a multi-year collaboration that sought to recruit researchers to complete a predictive task by predicting the same outcomes using the same data. Over 457 groups applied, of which 160 were selected to participate, and their predictions were evaluated with an error metric that assessed their ability to predict held-out data (i.e., data held by the organizer and not available to the participants).

The Challenge was an outgrowth of the Fragile Families Study (formerly Fragile Families and Child Wellbeing Study) based at Princeton, Columbia University, and the University of Michigan, which has been studying a cohort of about 5,000 children born in 20 large American cities between 1998 and 2000. It’s designed to oversample births to unmarried couples in those cities, and to address four questions of interest to researchers and policymakers:

  • The conditions and capabilities of unmarried parents
  • The nature of the relationships between unmarried parents
  • How the children born into these families fare
  • How policies and environmental conditions affect families and children

“When we began, I really didn’t know what a mass collaboration was, but I knew it would be a good idea to introduce our data to a new group of researchers: data scientists,” said Sara McLanahan, the William S. Tod Professor of Sociology and Public Affairs at Princeton. “The results were eye-opening.”

The Fragile Families Study data set consists of modules, each of which is made up of roughly 10 sections, where each section includes questions about a topic asked of the children’s parents, caregivers, teachers, and the children themselves. For example, a mother who recently gave birth might be asked about relationships with extended kin, government programs, and marriage attitudes, while a 9-year-old child might be asked about parental supervision, sibling relationships, and school. In addition to the surveys, the corpus contains the results of in-home assessments, including psychometric testing, biometric measurements, and observations of neighborhoods and homes.

The goal of the Challenge was to predict the social outcomes of children aged 15 years, which encompasses 1,617 variables. From the variables, six were selected to be the focus:

  • Grade point average
  • Grit
  • Household eviction
  • Material hardship
  • Primary caregiver layoff
  • Primary caregiver participation in job training

Contributing researchers were provided anonymized background data from 4,242 families and 12,942 variables about each family, as well as training data incorporating the six outcomes for half of the families. Once the Challenge was completed, all 160 submissions were scored using the holdout data.

In the end, even the best of the over 3,000 models submitted — which often used complex AI methods and had access to thousands of predictor variables — weren’t spot on. In fact, they were only marginally better than linear regression and logistic regression, which don’t rely on any form of machine learning.

“Either luck plays a major role in people’s lives, or our theories as social scientists are missing some important variable,” added McLanahan. “It’s too early at this point to know for sure.”

Measured by the coefficient of determination, or the correlation of the best model’s predictions with the ground truth data, “material hardship” — i.e., whether 15-year-old children’s parents suffered financial issues — was .23, or 23% accuracy. GPA predictions were 0.19 (19%), while grit, eviction, job training, and layoffs were 0.06 (6%), 0.05 (5%), and 0.03 (3%), respectively.

“The results raise questions about the relative performance of complex machine-learning models compared with simple benchmark models. In the … Challenge, the simple benchmark model with only a few predictors was only slightly worse than the most accurate submission, and it actually outperformed many of the submissions,” concluded the study’s coauthors. “Therefore, before using complex predictive models, we recommend that policymakers determine whether the achievable level of predictive accuracy is appropriate for the setting where the predictions will be used, whether complex models are more accurate than simple models or domain experts in their setting, and whether possible improvement in predictive performance is worth the additional costs to create, test, and understand the more complex model.”

The research team is currently applying for grants to continue studies in this area, and they’ve also published 12 of the teams’ results in a special issue of a journal called Socius, a new open-access journal from the American Sociological Association. In order to support additional research, all the submissions to the Challenge — including the code, predictions, and narrative explanations — will be made publicly available.

Algorithmic bias

The Challenge isn’t the first to expose the predictive shortcomings of AI and machine learning models. The Partnership on AI, a nonprofit coalition committed to the responsible use of AI, concluded in its first-ever report last year that algorithms are unfit to automate the pre-trial bail process or label some people as high-risk and detain them. The use of algorithms in decision making for judges has been known to produce race-based unfair results that are more likely to label African-American inmates as at risk of recidivism.

It’s well-understood that AI has a bias problem. For instance, word embedding, a common algorithmic training technique that involves linking words to vectors, unavoidably picks up — and at worst amplifies — prejudices implicit in source text and dialogue. A recent study by the National Institute of Standards and Technology (NIST) found that many facial recognition systems misidentify people of color more often than Caucasian faces. And Amazon’s internal recruitment tool — which was trained on resumes submitted over a 10-year period — was reportedly scrapped because it showed bias against women.

A number of solutions have been proposed, from algorithmic tools to services that detect bias by crowdsourcing large training data sets.

In June 2019, working with experts in AI fairness, Microsoft revised and expanded the data sets it uses to train Face API, a Microsoft Azure API that provides algorithms for detecting, recognizing, and analyzing human faces in images. Last May, Facebook announced Fairness Flow, which automatically sends a warning if an algorithm is making an unfair judgment about a person based on their race, gender, or age. Google recently released the What-If Tool, a bias-detecting feature of the TensorBoard web dashboard for its TensorFlow machine learning framework. Not to be outdone, IBM last fall released AI Fairness 360, a cloud-based, fully automated suite that “continually provides [insights]” into how AI systems are making their decisions and recommends adjustments — such as algorithmic tweaks or counterbalancing data — that might lessen the impact of prejudice.