The very real risks of excluding the disengaged from your dataset

Analysts use machine learning tools to analyze the vast amounts of data now available. Big data is appealing, but even the best tools and techniques will give you noise if the underlying data aren’t robust and inclusive. To reliably predict elections, correctly anticipate consumer demand or to accurately predict the trajectory of a pandemic, leaders need to ask themselves who is (and isn't) included in their dataset.

For example, we are approaching the four-year anniversary of Britain’s shocking vote to leave the European Union. People around the globe were stunned with this decision, largely because polls told us there would be no Brexit. Most young Britons supported remaining in the EU, but many of them didn’t vote that day. The behavior of these quiet, disengaged young people made the crucial difference to the outcome. How come the conventional surveys missed this? What did pollsters ask these young people who did not plan to vote? Not very much because the reality is these pollsters did not connect with many of them. As a result, the Brexit polls overrepresented the views of voters and underrepresented the disengaged.

This is one notable example of the real risks of relying on data from the usual, more engaged voices rather than the broadest, most diverse set of voices, including those that don’t typically fill out polls or are not active on social media. The same thing has happened in numerous referendums or elections since, including the surprise election of Donald Trump as U.S. president in 2016.

The risks of excluding the disengaged apply far beyond predicting referendums or elections, as they also impact a wide range of critical business, economic, and public policy issues. For instance, during the COVID-19 pandemic, we learned that if we didn't hear from all sources, hidden data can unexpectedly lead to new outbreaks.

Social media analysis is appealing because it provides a large, continuing dataset. But big data can heighten the risk of drawing a wrong conclusion when applied to a narrow group of voices. In fact, most people are not active on social media. Traditional tools of business intelligence such as focus groups and panel surveys may also amplify our biases when they exclude the diverse set of voices that we need for reliable business intelligence.

To reliably predict consumer demand, accurately predict the trajectory of a pandemic, or to avoid overreacting to a corporate crisis, leaders need to ask themselves who they are not hearing from and find solutions to include them in their data. The same principle applies to understanding economic trends. Young people and new immigrant groups tend not to participate in the surveys that underlie employment data. But we need more inclusive data that capture these groups or leaders might fail to ensure the appropriate dose of economic relief at the appropriate time to populations that haven’t been included – such as young people, for example, who don’t typically answer traditional surveys.

To address this problem, we’re using a technology that randomly engages anyone using the Web. The method is analogous to random-digit dialing for the web-using population, with the aim of reaching far beyond typical survey respondents to a much broader set of the population in each country of the world.

What have we learned from these more disengaged voices? That Donald Trump would be elected in 2016 and that there was a real risk of Brexit in advance of the Brexit “surprise”. We’ve heard from Americans who don’t typically answer government surveys about their concerns around vaccination, and learned what would persuade them to get vaccinated. We’re also using this approach to get more reliable, independent sentiment data in Russia amidst the current crisis in Ukraine - and have learned that fewer Russians support Putin’s approach to the invasion than is commonly reported in typical opinion polls, and also that fewer people oppose him than might be suggested by anti-war protests.

As all of us evaluate data for decision-making, we can’t get caught up in applying the latest machine learning tools to big data before we ask: Who is left out of our dataset? What can we learn from those people? If we don’t solve that problem, we risk getting our predictions and decisions wrong.

Greg Wong is CEO and Danielle Goldfarb is head of global research at RIWI.

Welcome to the VentureBeat community!

Our guest posting program is where technical experts share insights and provide neutral, non-vested deep dives on AI, data infrastructure, cybersecurity and other cutting-edge technologies shaping the future of enterprise.

Read more from our guest post program — and check out our guidelines if you’re interested in contributing an article of your own!

More