Pollsters and other politically-minded data jockeys likely spent much of Wednesday questioning their life choices as their predictions made them seem more akin to numerologists than to data scientists. The failure of polling and modelling to correctly predict the outcome of the presidential election has caused some to question the overall use of data and algorithms to answer important societal questions, at least those involving human behavior.
The lessons from these failures, however, are not about the limits of data, as a recent New York Times article concluded, but instead stem from the age-old garbage-in, garbage-out cliché. In particular, these failures remind us of two points regarding data analysis. First, even assuming it is possible to put together a random, representative sample, or weight the sample appropriately, asking people their opinions is often unlikely to yield key information since they have no incentive to be truthful (or to respond at all) when answering surveys. Second, analyzing data—especially survey data that are flawed to begin with—using models based on previous events is likely to yield misleading answers if the underlying phenomena the model is intended to capture have changed. That is, if fundamental conditions have changed since the model was built, the model will not predict the future accurately.
It is precisely these failures that the new data revolution seeks to mitigate. Economists and others have long argued that what people actually do, or “revealed preference,” is a better indicator of true preferences than what people say will do. Big data, one component of the data revolution, generally refers to precisely that—vast quantities of data on individuals’ characteristics and what they do, not on what they say they will do. In addition, machine learning (ML)—the key tool used to analyze big data—updates its underlying model in real time as it gathers new information, thereby potentially lessening the problem of forecasting based on outdated models of the world.
These observations put policy debates about big data and privacy, for example, into sharp relief. Privacy advocates worry about potentially negative implications of people sharing so much information. But without data on actual behavior and ways to rapidly analyze new information, we risk being forever ignorant of important issues and trends at the time such information would be useful. Alternate prediction models based on data not collected via survey may prove themselves to be more accurate. According to a BBC report, several algorithms that used ML methods and social media proved to be more accurate predictors of election outcomes, recognizing Trump’s likelihood of winning when the other models did not.
Of course, it can be difficult to evaluate predictive models, even in hindsight, because future events are probabilistic. How does one determine whether a model predicting a 70 percent chance of an event is wrong if that event did not happen? After all, it had a 30 percent chance of not happening. If the event were to play out thousands of times and the predicted outcome occurred about 70 percent of the time we could conclude the model accurately reflected the real world. In some cases, such as consumer purchases, modelers can run that test. But a given election does not happen thousands of times, making it difficult to know whether the model was “wrong.” Even so, nearly every (public) model predicted heavy odds of a Clinton victory, lending credence to the hypothesis that the models were based on flawed underlying data or calibrated using previous elections, where outcomes were likely driven by factors different from those in 2016.
To be clear, I am not arguing that surveys are irrelevant. It may not always be possible to gain information from observed behavior, and sometimes we are interested in what people think rather than what they do. And, of course, people who work on surveys have done extensive research to better hone their ability to elicit truthful information and to analyze it. Still, the election highlights surveys’ weaknesses in certain situations.
I am also not arguing that more data or big data will always yield more reliable results. For example, it is difficult to believe that Hillary Clinton’s campaign, with its famed tech know-how, did not incorporate data available from data brokers into its models, which appear to have failed. On the other hand, a data analytics company hired by the Trump campaign explained that it used vast quantities of data with positive effects from that campaign’s point of view. Someday, the full data story of the campaigns will be written and we will have more insight into what explains the failures and (rare) successes.
Big data and machine learning, however, while hardly constituting a perfectly accurate crystal ball, provide better tools for planning and decision-making in many situations. Data consisting of millions or billions of data points analyzed using algorithms that can update in real-time have the potential to be more accurate than data based on surveys and static models.
As big data and ML continue to develop, we should become better at incorporating different types of data and building more accurate prediction models. On the other hand, if we’ve learned one thing from the election, it’s that “experts” are often wrong, so perhaps you should take the arguments here with more than a grain of salt. Now, if only we could predict how many grains you should take…
As is so often the case, The Simpsons has already covered this:
The Only Accurate Way to Learn Voters’ Preferences