One of the rules in political science is to not let data drive your theories. In other words, one should try to come up with a theory before looking at the data on which it will be tested. This minimizes the risk of misinterpreting correlation for causation, a common mistake in social sciences.
To make sense of this rule, let us look at an example. Every year when winter approaches, we observe an increase in the number of flu cases accompanied by an increase in the sales of winter gloves. Being imaginative, one could come up with a “theory” that postulates that the use of winter gloves is driving up the number of flu cases.
Taking this further, we can presume a causal mechanism stating that sweaty hands are more likely to catch germs. In conclusion, people that use gloves are at higher risk of getting an infectious disease such as the flu.
Sounds good, despite being completely wrong. We know for a fact that weather patterns explain much of the transmission of the virus that causes the flu as well as the seasonal sales of winter gear.
What this example crudely shows is that there is a downside to allowing our observations of the world to guide our theory-building efforts. Ideally, our theories should be guided less by what we observe and more by our trained intuition of the mechanisms that might be linking our variables of interest.
Until Big Data Came to the Forefront
What is big data? Simply put, datasets large and complex enough as to overwhelm relational software (such as Excel and SPSS).
For political scientists, the volume, variety, and velocity of big data means that it might be concealing information about citizens’ political behaviour. The key word here is ‘concealing’: the trends in big data escape the human eye, but can be identified with machines.
Consequences of Big Data for Theory-Building
It is hard to exaggerate the consequences of big data analysis for theory-building in political science: it is moving the discipline from being a theory-driven enterprise with emphasis on explaining, to a data-driven one with emphasis on prediction. This is a fundamental change in the nature of the discipline. Theory-driven research postulates that correlation is not causation and no conclusion should be made on the basis of covariation. Data-driven research postulates that correlation is sufficient, and accurate predictive models can be built on top of it.
Now, you can understand the excitement over big data in political analysis among campaign strategists and politicians. They are now free of any theoretical strings attached to analyze extensive databases to observe and forecast the behaviour of voters in order to craft the ideal message to mobilize them.
It is Exciting, Yes.
But There is Another Side of the Coin
Some political scientists are sounding the alarm that the use of big data in political campaigning is increasing polarization in societies. This is because strategists are targeting the most ardent supporters with messages and ads designed to rally these die-hards in the final weeks of the campaign. By doing this, however, candidates are ignoring their more moderate voters and dismissing their opponents, driving the electorate to the fringes.
We saw this in the last days in the US elections, as both presidential candidates aggressively targeted their most ardent supporters in battleground states like Florida and Arizona, completely dismissing the undecided voter in these and other states.
Even more worrisome is the fact that forecasting events can potentially alter social outcomes like elections. Knowing who is in the lead irremediably changes the incentives and strategies of candidates and voters. This essentially turns big data analysis into a kind of field research, with the aggravating circumstance that it could have spillover effects on the the social outcome of interest.
It is said that predictive models are as robust as the data on which they are built. This is true. But when it comes to social phenomena, the answers they provide ought to be understood through the lens of the institutions, ideas, and historical processes in which data is embedded. When modelling these outcomes, we need to be keenly aware of the potential spillover effects of our own research.
Data is Political
Everything about it. How it is produced, collected, distributed, used, and disposed. These decisions do not take place in a vacuum but in a specific political system in which actors have competing interests. Data is not a natural resource nor is its use inconsequential in politics. Now more than ever, the act of observation may change what is being observed. As we endeavor to become more predictive and use data in new and imaginative ways, let us not lose sight of the human element in everything. Data is not itself truth. Data is a material we use to get insights to make sense of a world we believe is governed by some sort of causal ordering.
Interested in digging into the data of the 2020 US election? Browse RIWI’s 2020 U.S. Election Results and Popular Vote Predictor dataset through the Namara Marketplace or book time with one of our experts to determine how ThinkData’s tech can advance your projects.