This is a project I worked on and wrote when I was a second year PhD student – roughly around 2016-2017. It was intended as a blog form piece. I’ve left the original text as is – there may be typos or occasional incomplete logic, but surely this will be useful to someone.
To cite this resource:
Shaheen, Joseph A. E., Twitter Bot Classification Using Bayesian Machine Learning, https://doi.org/10.6084/m9.figshare.14448126, accessed from https://web.archive.org/web/20210419175444/https://www.josephshaheen.com/portfolio/project-twitter-bot-classification-using-bayesian-machine-learning
The prevalence of bots on social media platforms (like the famed twitter bot) has become an issue of great importance and with national security implications. ISIS, hackers, and nationalist (white and others) extremist groups use social bots to spread propaganda and hate messages while inflating their perceived support and presence. Social media platforms need more accurate methods of identifying these bots and must ensure that this is done with a high level of precision lest they alienate their real user base. This short project embarks on an exploration of this challenge using a simple Bayesian probability/classification model (Naïve Bayes), but I’ll report to you non-naive model which I don’t test for precision of prediction – why not? It’s just a blog!
With the recent interference in the US Presidential Election by foreign interests (or the attempt), and the growth of extremism on the social web by both domestic and foreign extremist groups, social media platforms such as Twitter and Facebook have been struggling with the challenge of identifying automation methods used by various groups (Check Trefis, 2017) [References for all citations are at the bottom].
Social media automation, or as it more commonly known as the bot trolling, are commonly used to propagate false information or conduct INFO-OPS/PSYOPS by state and non-state actors. This tactic serves to deliver multiple advantages: For instance, it provides for plausible deniability by groups propagating hateful ideologies or conducting illegal activities, or perhaps attempting to conceal true intentions by using anonymous or difficult-to-trace accounts. On platforms such as Twitter, the creation of bot accounts is made especially convenient through the use of a functional API that allows for the controlling of account actions remotely, and to top it off you can do it with code (scripts).
Recently, bot utilization has been used by extremist groups such as ISIS/ISIL/DAESH for the recruitment of operatives all over the world with great success (see Shaheen, 2015). In the case of ISIS, bots were used in the face of aggressive suspension and deletion by Twitter on behalf of various governmental organizations, while some were allowed to continue operating—presumably for intelligence collection purposes. However, in general, since 2015, while ISIS was reaching the height of its activity on social media, delivering roughly half a million (Shaheen, 2015) propaganda messages per month, a large scale (likely classified) effort was undertaken to identify bot and automated accounts.
Side note: And for years now the Pentagon and other agencies have been throwing money hand over fist to anyone that can provide a sustainable solution. This is not an easy problem – trust me.
Bot utilization for propaganda dissemination by groups like ISIS delineate an important tactic in extremist groups’ arsenal of disinformation strategies – by inflating a group’s presence, they’re able to provide a level of social proof for fringe ideologies and enhance the perception that the group and its ideology are actually part of the main stream. And, though the idea of using automated methods to enhance social influence on the web is not a new idea (and goes all the way back t o the Ancient Internet e.g IRC and ICQ), in a climate where a large proportion of internet users receive information through social media platforms the tactic becomes effective. Consequently, the effort to identify bot and automated accounts becomes ever more important.
Ferrara et al. (2014) discusses and outlines some of those efforts and the rise of social bots in general. The authors list previous attempts to identify bots and sybils (multiple accounts controlled by a single user) using social graphs (Cao et al., 2012), crowd-sourcing of social bot detection (Wang et al., 2013b), the Bot or Not project (Lee et al.,2011) which uses an off-the-shelf supervised learning algorithm to classify bots based on features, and mixed mode systems (Alvisi et al., 2013). With the exception of the Bot or Not project which in 2011 achieved a 95% accuracy in identifying bots, most attempts resulted in a high number of false positives. This makes those algorithms particularly unsuitable for the task at hand – as service providers like Twitter and Facebook would likely rather avoid suspension of a real user’s account than to allow a bot to continue sharing false information. This is perhaps the central challenge of the bot identification paradigm.
And, while Wang (2010) and Tavares & Faisal (2013) use machine learning methods of detection and time-factors-based methods of classification respectively, along with the aforementioned methods, it is not clear that a proven and standard method in the open source space has been successfully proposed. Maybe there are some super potent private sector solutions to which I am not privy! Who knows!?
Bayesian – and generally – graphical probability models are especially amenable to this challenge. All that was reason enough for me to want to test the difficulty of this challenge using off-the-shelf Bayesian machine learning methods and report my findings.
I used R to collect the complete profile of my twitter account (@JosephShaheen), and an exploratory analysis of the data set was administered – just to get a sense of the data set. This included the identification of features that can be used in a classification algorithm.[blockquote right=”pull-right”]And for years now the Pentagon and other agencies have been throwing money hand-over-fist to anyone that can provide a sustainable solution. [/blockquote]
I identified 17 (with the id duplicated) features of user account’s followers that can potentially be used for classification. Those features are listed in Table 1.
|Account ID||This is a numeric identifier of each account|
|Status Count||The number of status posts an account had made|
|Followers||the number of followers that follow said account|
|Favorites||The number of statuses the account has been favorited|
|Friends||The number of account the user account follows|
|Url||The url listed on the user’s profile|
|Name||The name listed on account|
|Creation Date||The date the account was created|
|Protected Status||Whether account’s profile is private|
|Verified Status||Whether account has been verified to be real by Twitter|
|Screen Name||The screen name of account|
|Location||Location listed by account holder|
|Listed Count||The count of how many times an account has been listed|
|Follow Requested||Whether user has requested to follow account|
|Profile Image url||The url of the image of account|
Table 1: 17 features, only 14 of which were ultimately used in the analysis
The method utilized for this challenge works under the assumption that given the qualitative (categorical) and quantitative (continuous) features of the target collection of accounts, we can identify a directed graphical probability model (a Bayesian network) that could model our collected feature data and consequently allow for inference.
Using a number of open source libraries such as TwitteR, I collected the data. I used the popular caret and e1071 libraries for some analysis because they’re popular machine learning packages with a native Naive Bayes implementations, but most importantly the bnlearn library – a state of the art Bayesian network development package with numerous functions and classes. I developed and tested several models. I report on two of those models.
A brief discussion of the process of data cleanup and formatting is appropriate at this stage. I primarily focused my analysis on features of accounts that could be interpreted by the native functions of the aforementioned packages. That meant that although there is likely some relationship between the profile image of an account and their account’s status as a bot, I did not employ image recognition algorithms to conduct an analysis of the images. We already know that most bots are what we call “eggs” on Twitter. We call them eggs because the standard default image when you don’t post a profile picture looks like an egg (go see for yourself).
The account’s creation date was also not used as it presented difficulties in translation to a numeric form which wasn’t easily interpreted once the data was discretized. This goes to our lack of knowledge of the data creation distribution of the Twitter platform. In layman’s terms, I don’t know if more or less bots were created this year or last year so I can’t provide a good prior estimate for it from data.
Thus, those features were discarded.
All numeric features, such as number of followers, and the number of statuses posted by accounts were discretized into an arbitrary number of categories chosen to ensure that there would be sufficient resolution in the output categories to be meaningful. For example, the lowest number of status messages in our training data for all accounts was 0, and the highest number was roughly 1.6 million. This scale was then divided into 10 intervals. The intervals were automatically calculated using bnlearn’s discretization function and the discretized dataset was used across all output models and packages. As it turns out, higher numbers of category selections were not possible using the bnlearn package, but it was not clear whether this was an artifact of our code or of the library itself. Right here was the first challenge – using a fixed and linear discretization like that is lazy and in bad taste. As a complexity scholar – I really should’ve divided up the tweet distribution into power law or exponential bins – considering that it is plainly obvious to any complexity researcher that the tweet distribution on the platform will be highly skewed. Oh well!
I considered developing a model which was entirely continuous or mixed instead of entirely discrete. Continuous models would’ve required that categorical variables be discarded, potentially at the expense of inference and explanatory power, and for our data-set there may have been challenges in specifying distributions in closed form without additional study of the data set. And as reported in the literature, the Boettcher and Dethlefson (2003) mixed variable model does not allow a node associated with a continuous variable to be the parent of a node associated with a discrete variable, and therefore it would have limited our ability to develop a non-naive model.
The training data was gathered from two samples. The first being all followers of author’s twitter account (roughly 2500 followers). As is typical with a naïve bayes learning model, I took a supervised learning approach by manually identifying bots in the data set using a simple Turing Test (if it looks, acts, and tweets like a bot then it’s a bot). Unfortunately, this method yielded only two accounts that were identified as bots with some certainty from my followers (I’m insulted that no one thinks I’m important enough to have a bunch of bots troll me. Haha!) Thus, the ability to ‘teach’ any probability model, Bayesian or otherwise would have been limited.
Long story short – I needed the feature data set of bots that I knew for certain to be bots – in order to train my model. Consequently, I identified a ‘black hat’ method which gave me what I needed (Sshhh! Don’t tell anyone!). I ended up with the feature data set of some 4000 beautiful, totally real bots – and I could’ve asked for more! My sample was made up entirely of bots with an almost absolute certainty and thus the probability of developing the right model would be maximized. We’ll skip the details on how I know that and how I gathered the bots. 😉
In order to ensure that my model is learned on both bot and non-bot data, the two samples were combined to include portions from my account’s follow group and the others from the bot group.
Herein lies the second major challenge: For me to guess a prior I need to know how many bots – total – are on twitter. Think Bayes Theorem (formula) where I need P(Bot I Feature) [probability that the account is a bot given the feature] – I still need P(Bot). I can guess that, of course – but the moment I combined the two data sets – which at this point were no longer random samples, I would’ve had to combine them in the right proportion exemplifying the ratio that bots are found on Twitter. Hopefully that’s clear!
I then trained a naïve Bayesian classification model on the discretized data using the e1071 and caret packages with an assumption of independence of all the data’s feature nodes. To test our resulting model I gathered the follow list of the George Mason University School of Engineering Twitter account and used that dataset as the prediction/test model data.
I asked my predictive model to predict which of the Engineering School’s followers were bots, and then navigated to each account and determined whether the prediction was accurate the old-fashioned way – a Turing test.
Finally, I used the bnlearn package to develop a full Bayesian non-naïve model and also used that model for predictive purposes.
Right here was the first challenge – using a fixed and linear discretization like that is lazy and in bad taste. Oh well!
The naïve Bayes model is fully described below including all relevant distributions, but we will discuss some of the node probability distributions to make sure we can agree with the model using common sense. As I previously mentioned, we allowed our description length to be a continuous numerical variable representing the number of characters of the description length and later discretized that integer variable. It was interesting then to find that node probability for the 10 discretized categories appeared to be normal with a right skew positioning of the distribution and with a median somewhere between the 133-177 character range in the case where the designated observation was not a bot. And, that there was a high probability (0.92) of the length of the bio/profile description falling in the smallest category of 0-44 characters. This should make sense to you because it should lead you to the following: low quality bots are created with a goal of optimizing operations so a creative and long description would not have been optimal for the bots’ creators. They got other ***t to do! Good for them. That seems to make sense to me.
Consequently, I identified a ‘black hat’ method which gave me what I needed (Sshhh! Don’t tell anyone!).
What didn’t make sense in the NB classifier was that I found very similar distributions if status messages when considering both the bot and non-bot cases ($Status_table), both having roughly the same probability of being in the lowest category of 0.99, indicating that perhaps status message count is not a credible classifier for our data. This is super important – the kinds of bots I was able to get were low quality and mostly eggs – inactive ones. The Russian and ISIS bots are highly active though – so my naive classifier is only as good ad my training data.
Another interesting result of this classifier was the apparent contrast between the state probabilities of the url_existence node, which is a categorical variable indicating whether the account user has listed an external url. For non-bots the probability of posting an external url on their profile was 0.71 while it was 0.04 for a bot classification to be true, suggesting that bot accounts typically did not include an external url, perhaps because it would require additional effort to create an external website that could provide additional social proof for the identity of the fake account.
The results of the verified status feature were super interesting. Verified status is the status of an account which has been manually verified by Twitter. These verification are usually made for some of the most influential public figures and media personalities manually by Twitter employees, and so likely would have an almost certain probability of being a real account. However, those accounts are rare to find to begin with when considering the total number of accounts on Twitter. Perhaps this is why the prior distribution of the node is such that it is 0.999 for when a user is verified but not a bot, versus a difference of 2 orders of magnitude for when they are not verified and not a bot (0.018 versus 0.0009).
Overall, the popularity of the naive Bayes implementation seemed to suggest that it would meet expectations in terms of how well it classified data—nothing seemed to suggest that this method of classification would yield a less than optimal solution. That’s why it was unexpected when we used the follower list from the GMU School of engineering’s Twitter account to test the model, we received a prediction of 510 bot accounts, 684 non-bot accounts out of the total 1194 followers of the school. This seemed to be on the extreme side in terms of prediction.
Certainly, my Turing test on that data didn’t reveal more than two-dozen accounts that might’ve been bots!
In investigating the causes of a high rate of prediction when compared with our expectation let’s illustrate using an actual observation where there was a false positive [Type 1 error] prediction (Table 2).
Table 2: This table describes the features of one example account
If we take this particular example and we calculate P(Bot = True | Features) according to our model, we find the quantity equal to the product of the node probabilities for that observation (0.99 x 0.98 x 0.99 x 0.99 x 0.99 x 0.96 x 0.437 x 0.95 x 0.95 x 0.43 x 0.907 x 1.0) which is equal to 0.139 or that this particular observation has a 14% chance of being a bot. From the documentation of the naïve Bayes implementation, it is not clear what the confidence interval used for prediction actually is, but node probabilities are assumed to be Gaussian for the native implementation.
Moreover, the observed features of this user almost all fall in the lower categories for a number of key feature nodes. In fact, in checking the model, our use of the ‘interval’ method to discretize our data meant that the majority of our training sample for more than half our features were concentrated in the lower bounds of the resulting categories. For example, the difference between the top tweeter and the lowest (mentioned earlier) was 1.4 Million (the lowest was nil), but the discretization of training data using the ‘interval’ method in bnlearn resulted in 10 equally spaced intervals, and because the majority of users would have a much lower number of tweets, the result was that more than 99% of our sample was included in the lowest interval, while the mid and upper intervals were generally data-sparse. My discretization choices likely made an enormous contribution to the large number of false positives received. This would be corrected in future iterations of the model.
Using the same discretized form however, could potentially yield additional insights into this classification problem if we allow for a freely developing structure. Therefore, the next step would have been to use bnlearn structure and parameter learning functions to identify a more complex structure for the network as an indicator of where additional improvements can be made to the classifier. Figure 2 shows the result of this effort.
Figure 2:The non-naïve structure learned from data
Interestingly, the majority of features are found not to be direct descendants of the bot/not bot classifier node, yielding insight that a better structure for our data can yield improved results even with an imprecise discretization technique as was used in the naïve model. In one instance the friend count (the count of how many users the account in question follows) was a great, great, great grandchild of the bot node (3 hops removed), suggesting that there are critical nodes that are far removed from each other and reduces the strength of independence assumptions that must be made to justify a naïve Bayes. The structure learning algorithm used a score based hill climbing greedy search method (Scutari, 2010).
Discovering this model will be left to future iterations of the model, but it presents interesting opportunities.
Classification problems are not easy problems, and as was shown in this project blog it is often not enough to run a standard algorithm blindly without making important considerations about the data available, the method of input, the depth of classification and other important issues – perhaps this is the most important lesson learned here.
However, there are significant opportunities in the space of bot classification. As mentioned in the introduction, there exists, to date, only one bot classifier with a public interface (the Bot or Not project), and this particular service uses data gathered in 2011—likely bot creation methods have become more sophisticated since then and the pervasiveness of bots has become much more significant.
The models I shared here use off-the-shelf methods, small data sets, and are based on a less than complete understanding of bot dynamics on social media. I know!! They leave much to be desired. The motivation, whether to combat extremism, hate groups, or to enhance trust in news and online information has never been greater.
From a more technical perspective, a naïve classifier with adequate discretization and modeling of node distributions can perform with reasonable effectiveness, though we were not able to show that here. You just need a pretty large dataset and the right slicing of our data, really.
However, the structural limitations imposed by a naïve Bayesian network will likely impede a more representative model as shown by the second model presented. There is also the question of including features that take into account images, tweet content, and social network structure that can provide a wealth of information to instantiate the model—all of these serve as potential improvements to the model in the next cycle of development.
- Bottcher, S. G. (2003). Learning Bayesian Networks with R. Proceedings of the 3rd International Workshop on Distributed Statistical Computing, (Vienna, Austria).
- de Jonge, E., & van der Loo, M. (2013). An introduction to data cleaning with R. Statistics Netherlands, 53. http://doi.org/60083 201313- X-10-13
- Dethlefsen, C., & Højsgaard, S. (2005). A common platform for graphical models in R: The gRbase package. Journal of Statistical Software, 14(17), 1–12. http://doi.org/10.1002/dev.20059
- Ferrara, E., Varol, O., Davis, C., Menczer, F., & Flammini, A. (2014). The Rise of Social Bots. arXiv Preprint arXiv:1407.5225, (grant 220020274), 1–11. http://doi.org/10.1145/2818717
- Guntuku, S., Narang, P., & Hota, C. (2013). Real-time Peer-to-Peer Botnet Detection Framework based on Bayesian Regularized Neural Network. arXiv Preprint arXiv:1307.7464. Retrieved from http://arxiv.org/abs/1307.7464
- jsgaard, S. H. (2012). Graphical Independence Networks with the gRain Package for R. Journal of Statistical Software, 46(10), 1–26. http://doi.org/10.18637/jss.v046.i10
- Max, A., Contributions, K., Weston, S., Keefer, C., Engelhardt, A., Cooper, T., Candan, C. (2016). Package “caret.” Max Kuhn. http://doi.org/10.1126/science.1127647>
- Scutari, M. (2010). Learning Bayesian Networks with the bnlearn R Package. Journal of Statistical Software, 35(3), 1–22. http://doi.org/10.18637/jss.v035.i03
- Tavares, G., & Faisal, A. (2013). Scaling-Laws of Human Broadcast Communication Enable Distinction between Human, Corporate and Robot Twitter Users. PLoS ONE, 8(7). http://doi.org/10.1371/journal.pone.0065774
- Team, T. (2017). News That 48 Million of Twitter’s Users May Be Bots Could Impact Its Valuation. Retrieved January 5, 2017, from https://www.forbes.com/sites/greatspeculations/2017/03/22/news-that-48-million-of-twitters-users-may-be-bots-could-impact-its-valuation/#43aef9e36086
- Villamarín-Salomón, R., & Brustoloni, J. C. (2009). Bayesian bot detection based on DNS traffic similarity. Proceedings of The Acm Symposium On Applied Computing, (1), 2035–2041. http://doi.org/10.1145/1529282.1529734
- Wang, A. (2010). Detecting spam bots in online social networking sites: a machine learning approach. Data and Applications Security and Privacy XXIV, 335–342. http://doi.org/10.1007/978-3-642-13739-6_25
- How to Spot a Social Bot on Twitter – MIT Technology Review. (n.d.). Retrieved March 21, 2017, from https://www.technologyreview.com/s/529461/how-to-spot-a-social-bot-on-twitter/
 It should be noted that as social bots (and their creators) tend to be highly adaptable that this level of accuracy could not be maintained with a more updated data set collected in 2017.
 Black hat is an internet term associated with methods that typically violate a website’s terms of service.