Stanford and Carnegie Mellon find race and age bias in mobility data that drives COVID-19 policy

Smartphone-based mobility data has played a major role in responses to the pandemic. Describing the movement of millions of people, location information from Google, Apple, and others has been used to analyze the effectiveness of social distancing polices and probe how different sectors of the economy have been affected. But a new study from researchers at Stanford and Carnegie Mellon finds that particular groups of people, including older and nonwhite U.S. voters, are less likely to be captured by mobility data than demographic majorities. The coauthors argue that these groups could be disproportionately harmed if biased mobility data is used to allocate public health resources.

Analytics providers like Factual, Radar, and PlaceIQ obtain data from opt-in location-sharing apps but rarely disclose which apps feed into their datasets, preventing policymakers and researchers from understanding who's represented. (Prior work has shown sociodemographic and age biases of smartphone ownership, with children and the elderly frequently underrepresented in mobile phone data.) Black, Native American, and Latinx communities have seen high case and death counts from COVID-19, and the pandemic has reinforced existing health inequities. If certain races or age groups are not well-represented in the data used to inform policy-making, there's risk of enacting policies that fail to help those at greatest risk, the Stanford and Carnegie Mellon researchers assert.

The team examined a mobility dataset maintained by startup SafeGraph, which contains smartphone location data from navigation, weather, and social media apps aggregated by points of interest (e.g., schools, restaurants, parks, airports, and brick-and-mortar stores). SafeGraph released much of its data for free as part of the COVID-19 Data Consortium when the pandemic hit, and as a result, the company's data has become the "dataset de rigueur" in pandemic research. For example, the U.S. Centers for Disease Control and Prevention employs it to identify health systems nearing capacity and to guide the agency's public health communications strategy. The California Governor's Office relies on SafeGraph data to develop COVID-19 policies, including risk measurements of specific areas and facilities and enforcement of physical distancing measures, as do the cities of Los Angeles, San Francisco, San Jose, San Antonio, Memphis, and Louisville.

SafeGraph published a report about the representativeness of its data, but the coauthors of this new study take issue with the company's methodology. In the interest of thoroughness, they created their own framework to assess how well SafeGraph measures ground truth mobility and whether its coverage varies with demographics.

The coauthors looked at 2018 voter turnout data in records from U.S. authorities, aiming to see whether SafeGraph coverage of voters at poll locations varied with those voters' demographics, which would give an indication as to whether demographic bias in the dataset existed. They used records provided by private voter file vendor L2 and poll precinct information from the North Carolina Secretary of State, netting a dataset of 595,000 voters who turned out at 549 different voting locations.

The results of the researchers' audit show that for voters over the age of 65 and nonwhites, the SafeGraph's data tracked mobility data poorly compared with younger, white voters. "The large coefficient on age indicates that each percentage point increase in voters over 65 is associated with a 4 point drop in rank relative to the optimal ranking," the coauthors explained. "Similarly, the coefficient on race indicates that each point increase in percent nonwhite is associated with a one point drop in rank relative to the optimal ranking. This demonstrates that ranking by SafeGraph traffic may disproportionately harm older and minority populations by, for instance, failing to locate pop-up testing sites where needed the most."

The apparent bias in SafeGraph's mobility data could lead to governments failing to adequately provide health care resources like masks and to make poorly informed decisions to open or close categories of businesses in public health orders. In one thought experiment conducted in the course of the study, the researchers found that strict reliance on SafeGraph would underallocate resources by 35% to areas with older and nonwhite populations and overallocate resources by 30% to the youngest and whitest groups.

The coauthors suggest a fix in the form of bias correction weights for age and race. They also call for increased transparency on the part of SafeGraph and other data providers, which they say might allow policymakers to use what's known about the sources of location information and make adjustments accordingly. "We find that coverage is notably skewed along race and age demographics, both of which are significant risk factors for COVID-19 related mortality," the coauthors wrote. "Without paying attention to such blind spots, we risk exacerbating serious existing inequities in the health care response to the pandemic."

More