Following up on our poll-driven modeling election series, and my previous prediction that in this election cycle the poll-model system that seems to be in place will fail, it behooves us to at least consider what kind of modeling system we would replace the current system with. I should tell you upfront that I don’t have a final answer yet, but I am playing with some code. This is just a think out loud kind of situation.
I did mention that an agent-based model would be the most likely contender to replace a micro-simulation in my last piece. The advantage would be that you can instantiate an entire population of voters and give each agent in that population his/her own expected behaviors based on demographics, prior voting behavior (part of which is public record), geography (for example urban versus rural) and their social network. The model can still be poll-driven but an ABM can make use of so much more of data gathered in polls (or that could be gathered in polls) to simulate the probabilities that baselines whether a voter votes for Candidate A or Candidate B. In theory, an agent-based model is perfect for that type of application, but we run into some technical challenges right off the bat. Let me outline some of them.
What are the challenges of a poll-driven agent-based election model?
Remember, you’ll have to create one agent representing a single voter as an object of a simulation – that’s what an ABM is. Let’s say that you will only instantiate potential voters to start (not the entire population and not including registered voters). Well, for this election cycle that’s about 120 million voters. But if you wanted to include social networks as a voting decision variable you’ll also have to include their social networks, which would likely means the entire US population including non-registered citizens and non-voters in general. And, since we are including their social networks the ties themselves would have to be simulated (as a general estimate let’s say there are 200 ties per node as an average, then we would have about 300 M * 200 = 6 Trillion ties to simulate) which presents enormous computational difficulties. All of that is what you would have to do just to instantiate the agent population or as it more commonly known – the synthetic population. Now they have to actually vote in the simulation which presents a greater challenge. Finally, so that we can achieve a level of statistical significance the simulation has to be run a few thousand times at minimum.
This can all be done with current tools and technologies. For example we can utilize a cloud cluster and with a little parallel programming, a ton of RAM, and some SSD hard disk space we’d be all set, but even then by no means would this be a quick simulation nor would it be an inexpensive one.
The Population’s Traits
Another challenge would be what kind of traits do we give the population. Traits include the demographic traits such as age, gender, race, income level, employment level and other important factors that seem to weigh voter decisions. Census Bureau data can give that information but it almost always aggregates it (for a number of reasons, one of them is anonymity) to some level. With certain types of data the aggregation is at the state level, others it’s the district or county level, but almost never by smaller groups than that. So, at the end of the day, using the existing data to instantiate the population would mean that if we look far enough down into the model you’ll find some homogeneity because of the lack of availability of heterogeneous data. Of course we can add artificial heterogeneity to the data, but wouldn’t that just be adding randomness to our data? and in theory shouldn’t that increase our final error margins?
Ultimately, it’s OK – you can still create a pretty good model with the available data, but it will take some smart adjustments to figure out how to deal with incomplete trait-based data – I guess that’s an ongoing challenge for any modeler anyway.
It’s clear to anyone who deals with modeling and simulation that social networks almost always have something to do with the final result, but it’s not clear how you would treat them in an election simulation. First, you’d have to determine which social network to model or decide whether you’d model all of them? (see the first challenge above). For example, would you focus on kin networks, assume that families would typically be highly influential on each other’s votes? or would you focus on friendship networks, or would you just model a social media platform like Twitter and claim that information propagation is the only real social network determinant?
Clearly, in this election cycle social media had a ‘uge (I couldn’t resist) role to play, but it’s not clear whether those who already supported the candidates used social media to share their political views, or whether seeing others’ political views influenced them to have similar views (see any Sociology book for an explanation of Reciprocity). It’s probably a combination of both, but that needs to be flushed out in some rigorous academic studies before we know for sure. It would be useful to use a real source of data to base the simulated social networks on, but it’s not clear what the best design criteria would be. Deciding that alone is a challenge.
Look! It’s abundantly clear this is not an easy model to put together. Maybe that’s why the political modeling experts opted for a micro-simulation modeling style in the first place, but sooner or later this kind of model has to be built because the population has already adapted to other methods, and in a country where having a competitive advantage is not only allowed but expected in the political…well…I’ll use the word Industry, sooner or later someone will have to take on this challenging model design.
Photo by Mike Licht, NotionsCapital.com