This is a guest post by data scientist David Smith
While I’m pleased with the outcome of yesterday’s election in political terms, I’m equally thrilled about the victory for data science versus gut instinct, philosophy and fundamentals.
This was nowhere better exemplified than on Fox News last night, were Republican strategist Karl Rove refused to believe the incoming returns, and broke out into argument with Fox’s own election analysts when they called the election for Obama. Even before the voting began, he was sure the Republicans had in the bag. He was wrong.
The battle between statistician Nate Silver and the pundits — especially on the right — was especially stark.
Pundits in the media have a built-in motivation to portray the race as a close one. Nothing drives pageviews like a horserace. But their refusal to admit the high probability of an Obama win was a classic case of tradition versus data science. They viewed data scientists as nothing more than glorified spreadsheet jockeys.
The pundit class, once valued for their instinct, insider knowledge and “expert” analysis, was clearly threatened by this successful use of data and facts. [Editor’s note: Indeed, VentureBeat will be discussing how leveraging “big data” is one of the biggest trends impacting the wider economy at its CloudBeat 2012 conference later this month.]
I’ve been following Silver’s FiveThirtyEight blog since before the 2008 election (well before it was acquired by the New York Times), and I’ve always admired his methods from a statistical point of view. His forecasts are much more than simple poll-averaging, and the reasons for his success are manifold:
- Use of many sources data. While many pundits used national polls (and often cherry-picked ones, at that), Silver also incorporated hundreds of state-level polls into his analysis. And poll data wasn’t the only source either: other factors that can influence elections, like economic variables, demographics, and party registration figures were also incorporated.
- Using the past to guide the future. Like any good statistician, Silver didn’t attempt to forecast the election by extrapolating from a limited range of data points. He also incorporated historical data (electoral outcomes, historical polls, and economic data) so that typical outcomes in past elections were given some weight to re-occur in the present.
- Extracting information from every source. While some analysts might cherry pick data sources according to whether they were qualitatively “reliable” or “unbiased”, Silver incorporated them all. And with good reason: there’s still information in the presence of bias. For example, Rasmussen is a reputable polling firm, but well known to have a “house effect” that favours Republican candidates. Rather than reject this data, Silver’s model instead looks at trends over time: if Rasmussen’s polls move from 55 percent Romney to 52 percent Romney in a week, that’s still information in Obama’s favor.
- Undersanding correlations. Unlike most amateur pundits, Silver understands that political data is connected. If Texas moves right on the political spectrum, it’s likely that Oklahama will too: they’re similar states that have moved in similar directions in the past. That’s one reason for including as much data as possible in the model: many such correlations will be captured by modeling underlying trends like party registration or minority residency. But there are still inherent correlations between states, districts and polls that Silver was able to estimate from past data and include in his model.
- Use of statistical models. To assemble all of this information requires much more than just spreadsheet-jockeying: this is more than Moneyball, folks. This is one of the areas where Silver impresses: he used the right regression models with appropriate distributional assumption to convert all that historical and contemporary data into race-by-race forecasts. You can’t do this stuff in Excel, you need sophisticated statistical modeling software.
- Monte-Carlo simulations for the Electoral College. By far the area where Silver impressed the most was in his methodology for foreacsting the presidential race. The national polls have insufficient information for this race, which is decided on a state-by-state level. His secret was to combine the results of the state-level analyses to estimate the allocation of the 538 electoral votes. (This is where the name of his blog comes from, by the way.) To incorporate all this information, and include the correlations between the state outcomes, is very difficult to do with equations. So he used an elegant solution: simulation. Every day, he ran thousands of mock elections, flipping a virtual coin (weighted according to the state-level models) for each of the 50 state, and counted the electoral votes. It was this process that led him to forecast a high probability of an Obama win: even though the national polls were close, the Democratic advantage in swing states with many electoral votes like Ohio and Pennsylvania was an unlimately unsurmountable edge.
- Understanding of the limitations of polls. Any analysis is only ever as good as the input data. Polls were a big part of the analysis, but they can also be unreliable: sample bias, cell phones vs land lines, language barriers. Silver understands this fact well, and included a factor modeling this variability in bias of polls in the model. That’s why, even though the “fundamentals” of the state-by-state polls made an Obama victory seem almost certain, he forecast that probability at “only” 90 percent. The remaining 10 percent was the chance that the polls were wrong enough to swing the result the other way.
- A consistency in methodology. Elections are dynamic beasts, but Silver never succumbed to the temptation to tweak his model — the only thing that changed was the incoming data. That’s what made his “chance of winning” charts (as shown below) which evolved as the campaign progressed, so compelling: it reflected only changes in the facts on the ground, not changes in Silver’s method of analyzing them. Contrast this with pollsters who routinely changed they sampling and weighting methodologies, making it less useful to compare changes in polls on a long-term basis.
- A focus on probabilities, not predictions. If you read his blog posts carefully (and not just others reporting on them), Silver was always careful to point out that the end results was ultimately unknowable until the final polls came in. While he was forecasting a 90% chance of an Obama win, there was still the possibility that Romney could beat the odds, or the polls could have been swinging enough in his direction, to pull out a win. Even if Romney had won, it wouldn’t have invalidated his methodology. We can only run each election once, but if we could rewind the clock and do it over many times, I’m confident that Obama would have won many more times than Romney based on Silver’s model.
- Great communication skills. Finally, and this is something I really admire, Silver has a great skill in being able to communicate complex statistical topics to a lay audience. Probabilities, especially, are something that most people lack an intuitive understanding for. (See: casinos, success of.) The way he described Romney’s chances the evening before the election was especially elegant, and was put in terms most people could relate to.
All of this leaves Mr. Romney drawing to an inside straight. I hope you’ll excuse the cliché, but it’s appropriate here: in poker, making an inside straight requires you to catch one of 4 cards out of 48 remaining in the deck, the chances of which are about 8 percent. Those are now about Mr. Romney’s chances of winning the Electoral College, according to the FiveThirtyEight forecast.
All of these points are an excellent example of data science in practice. Silver combined his statistical analysis skills, his statistical software and data-wrangling prowess, and a deep understanding of polling, economic influences, and above all the Electoral College to create a winning model. He deserves his success, and I hope it convinces those in the media that data is more powerful than punditry.
David Smith is an R evangelist, data scientist and blogger with a long history with the R and statistics communities. After graduating with a degree in Statistics from the University of Adelaide, South Australia, he spent four years researching statistical methodology at Lancaster University in the United Kingdom, where he also developed a number of packages for the S-PLUS statistical modeling environment. He leads marketing for Revolution Analytics, supports R communities worldwide and is responsible for the Revolutions blog.
Nate Silver screenshot via The Daily Show