I kind of Predicted it
My last article posted in June of 2016 talked about how in 2014 we will end up with a whole slew of new, sleek political models that will be able to “supposedly” predict the rise of Donald Trump. Tonight, as we watch the seemingly inevitable victory of Trump unfold, it’s becoming absolutely clear that the predictive and statistical poll-based models used by the most experienced professionals in the field are missing key components, or at minimum using bad training data.
Something is wrong with polling in general, the models that use the polling and the standard methodology in current use. In fact, up to 100 days prior to an election, synthetic markets seem to be better at predicting the winner (74% of the time) than your average poll.
As many of you know, the models that exist today were spawned in the last decade and a half as the newest iteration of political prediction schemes. Since their mainstreaming these models were actually quite good at providing predictions within some reasonable error margin.
It’s important to understand that the way that modeling and polling is done today is different from the way it used to be. You’ve probably heard of some of the more notorious models used right now, like Nate Silver’s FiveThirtyEight or Larry Sabato’s Crystal Ball Electoral College. 270towin has a good list up if you’re curious. It’s these kinds of models that seem to get the most attention, at least during this particular election cycle. The majority of these models are poll-driven micro-simulations, and it’s very important you know what a poll-driven micro simulation is.
What is a Poll-driven Micro Simulation
A micro-simulation is a simulation written as code or a computer program. You feed it some expected behaviors, maybe some data, a synthetic population (fictional but representing some reality), and then you run it. Then you run it a few hundred, thousand, or million times while measuring some output that you’re interested in. Once you’re done you look at how many times event A occured versus event B (like Clinton versus Trump winning or losing for example). For political election models, you’re generally interested in the final result, the popular vote, and the electoral college count if you’re in the United States and perhaps how the model will predict demographic votes in each scenario.
Under normal circumstances that model of prediction had seemed to be working very well but like any model it can have some flaws.
First, as in with any model: Garbage in Garbage out – if the poll data you use to run the simulation is wrong or contains some behavioral elements that you don’t take into consideration in an explicit way, then the simulation simply outputs what you put in there – garbage and all. For a long time this election cycle there had been many who were saying that poll data in swing states were flat out wrong because many voters were lying to the pollsters. There isn’t a clear reason on why that was happening but there are many hypothesis. Let’s mention some of them.
Some were saying that Trump voters were lying because it’s unpopular to publicly say that you’re voting for Trump in certain areas of the country, some were blaming the method of polling itself, while others were blaming the constant bombardment of your average voter in swing states with calls, emails and other forms of communication. Some of the more cynical views were that the voters simply wanted to stick it to the pollsters as a way of saying F*** you to all those people, who in the minds of a fraction of the voters, were telling them how to vote.
The point is the polls had to be wrong or at least produced untrusted results due to error for any of those reasons I just mentioned otherwise there is no reason to believe that the models using those polls were perfect either. Otherwise they would’ve contained the right kind of mechanisms to adjust for new voter behavior. Only rigorous academic studies will flush out the reasons. So let’s leave that alone for now.
One reasons I’d like to highlight was that the simulation-polling hybrid framework did not take into effect a number of other technical factors. When you build a micro-simulation you essentially run a type of Monte Carlo experiment. The simple way of explaining a Monte Carlo sim is that it’s an experiment where you use the probabilities and statistics you receive from polls as the deciding probability for how the vote in the simulation goes. The problem with that is that you don’t create a sufficiently diverse population that shifts and changes with news or where each individual voter makes his or her own decisions within the simulation itself (like what we do in an agent-based model) and independently from poll data. You just rely on having accurate poll data – end fo story.
A micro simulation is like putting all the voters into one pot, assuming that overall the poll data will sufficiently describe each one of them on the aggregate. This means that the micro simulation would fail to consider that most voters had already made up their minds (potentially), or that others were likely affected by their social networks. Not everyone of course, but perhaps just enough to make the 2%-4% difference in error. That heterogeneity that was missing from the micro simulation could potentially be the cause for not taking changing or pre-existing opinions into consideration with specific demographics. To be more specific, the micro simulation adjusts the simulation’s probabilities across the entire board, so if poll numbers for one candidate go up, it is assumed that the increase came from an existent candidate’s loss. In other words, a micro simulation has to assume that every time it’s run the turnout will be (probabilistically) the same and has no mechanism to explain increase in turnout through any deterministic way.
A better way
We have to stop modeling only the registered or voting population through micro-simulations. It’s clearly becoming less and less effective over time. The population is adjusting to the modeling, and more and more erratic behavior in poll response results can be expected, so this means that we now have to start modeling each individual voter through agent-based methods. In fact, we have to even model the entire population to get a better gauge of turnout too building simulations of every resident, citizen or not, registered or not, voter or not. We’ll also have to start taking into account social networks – and that may come in the form of social media networks as well as kin-based (family) networks. I’ll talk about that later – I’m testing a prototype model that will do just that.
Photo by Internet Archive Book Images