This is a project I worked on and wrote when I was a first year PhD student – roughly around 2015-2016. It was intended as a blog form piece. I’ve left the original text as is – there may be typos or occasional incomplete logic, but surely this will be useful to someone.
Introduction
Network science has grown in popularity in the last few decades, especially with the ubiquity of social media platforms and the internet in general. However, much remains to be explored and discovered in relation to the processes by which networks form, the underlying social forces that underpin their formation, and thus the statistical distributions formed by those social forces.
In this semi-formal blog/paper I will provide exploratory analysis and comparison of two networks—one is generated through an agent-based model utilizing a small number of simple rules/behaviors (a synthetic network) and a second network collected from the Twitter API using seed accounts for a terrorist organization used in an earlier paper to study propaganda dissemination (Shaheen, 2015). the state of the art power-law python module was used to compare the degree distributions of both networks and to discover the preference in distribution fitting.
This work builds on four separate but equally important lines of inquiry – the first being that of the computational and mathematical sociologists and anthropologists going as far back as Moreno (1934), Bott (1957), Milgram (1969) Granovetter (1973), Freeman (1978), as well as some of the most recent work by Burt, Borgatti, Johnson and others in which the structural properties of networks were investigated.
The second line of inquiry is that of the social physicists, epitomized by Watts and Strogatz (1998), and Barabasi and Albert (1999) in which random networks played a larger role in the exploration of networks as well as in the simulation of networks dynamically.
The third line of inquiry is that of agent-based modelers such as Epstein and Axtell (1996), Axelrod (1997) and Axtell (2001), where agent-level behaviors or rules were used to grow and reproduce system-level properties and effects—a growing field.
Finally, the fourth line of inquiry is that of the heavy tailed distribution analysis of which various developments have taken place over the last century beginning with Pareto (1896), Zipf (1936, 1949), Gabaix (2003) and more recently Clauset et. al (2009).
These lines of inquiry are relevant to the challenge of simulating networks and understanding their properties in silico, and serve as the foundation for the intersection of fields investigating the local forces of network formation and the system-level properties of those same networks.
Methodology
Collected Network
A Twitter backcloth network (friend/follow) network was collected using the Twitter REST API through the use of seed accounts to study terrorist propaganda (Shaheen, 2015). Seed accounts where known terrorist propaganda accounts belonging to the Islamic State of Iraq and Syria (ISIS). These seed accounts were used to form the basis of a snowball sampling method where seed accounts were used to identify other accounts who are connected to them up to some distance away from the original seed accounts.
On average, seed accounts ranged in number from 15-20 accounts. These accounts were then used to collect up to 200 account followers (indegree) at the first degree from ego, and subsequently, those accounts were used as seeds to collect their first degree connections. The process was repeated for a total of 4 degrees of separation from seed nodes. The result was a directed network sized at roughly 160,000 nodes formed by a single component, and no isolates. The period of collection was less than 12 hours and within the limits of the Twitter API rate of delivery. We assume returned accounts are randomized by the Twitter API.
These accounts were then used to form a network of friends and followers—friends being those who mutually follow each other where followers are those who follow accounts (out degree) but where reciprocity did not exist.
Subsequently, an analysis was conducted on the network with a focus on summary (network level) statistics, features, and properties. Figure 1 shows a visualization of the collected network and the generated network to be discussed in the next section.
Synthetic Network
An agent-based model was implemented using Python 2.X and the NetworkX package. The model used simple rules to generate an undirected network of roughly 80,000 nodes, affording a 1:2 comparison with our collected network. The ruleset was under a time-constraint condition such that new nodes entering the network arrived at a much slower rate than the rate of edge formation of existing nodes in the network. Using this time constraint is based on observations of the quantity of friend/follow activity when compared with the quantity of new accounts being created on Twitter.
The ruleset used to generate the intended network was simple and as follows:
- Enter new nodes into the network proportionately to the size of the existing network and according to some scalar parameter of the network size.
- Connect a random number of new nodes into the network randomly to existing nodes according to a uniform distribution.
- For a random number of nodes and in line with the time constraints placed on network iterations, connect ego node with a friend of a friend if available (triadic closure).
- Connect a random number of nodes to a random number of nodes according to some uniform distribution.
- Connect a random number of nodes to a random number of nodes, only if the target node possesses a higher degree centrality and according to some parametrized uniform distribution.
All probability distributions used in the growing of the generated network were of a uniform nature so as not to provide additional (unsupported) complexity to the model. To adjust the success rate as they are applied to the nodal actions of those distributions, a scalar parameter was used. For example, to bring a new node into the network a randomly generated number is drawn from a uniform distribution, and compared to some exogenous parameter set by the author. If the generated number is comparable (below or above) to the parameter set, a new node is brought into the network or denied access. This process was used for every action taken in the network formation, especially for the creation of new edges on the entry of new nodes, the formation of triads, and the formation of random edges in the network.
Parameters are adjusted to reach the desired effect in terms of the network’s summary statistics, but only those relevant to the entry of new nodes into the network, and the frequency of tie formation. This is not a structural adjustment of the network’s parameters and properties but simply a way to adjust the look and feel of the network to be more in line with expected outcomes. The effects of those parameters can be mostly seen on network properties such as density, and the raw centrality scores of nodes. Effects on the structure, and thus the shape and fitting of the degree distribution (which is our focus for this paper) are negligible.
Distribution Analysis
Finally, once the network is generated an analysis of the distributions of both the collected network and the synthetic network is conducted. The aim was to fit the distribution using Clauset et. al. (2007) seminal paper on power law analysis using a Python power law package developed by Alstott et. al. (2014).
A comparison of heavy tailed fitting models was conducted on both networks. Distributions such as the power law, exponential, power law with an exponential cutoff, and log-normal distributions were considered. In the analysis, isolates, or nodes with a null degree, were not included in the distribution comparison or the general fitting because of the mathematical challenges presented by substituting a null value in the power law distribution definitional equation, thus they were discarded.
Discarding isolates falls on a definitional issue, and while some methods, such as assigning all null degreed nodes a degree of one, can be undertaken, it is not clear what justification can be made one way or another on what social forces would allow for this convenience of mathematical transformation. Thus, to achieve a one-to-one comparison we simply allowed for the cutoff of isolate nodes from our distribution. This was only applicable to the generated network and not the collected network since by the method of data collection utilized (snowball sampling) all returned nodes were by design connected to at least one other node in the network—as previously mentioned—a single component.
Fig 1. Top: The collected Twitter network with 160,000 nodes, no isolates and high level of community clustering. Bottom: a generated network using simple rules with 80,000 nodes, some isolates, and high modularity.
Results
Visually, properties of a typical heavy-tailed distribution were confirmed for both the collected network and synthetic network. This is in-line with a multitude of research into this area. In verifying whether the generated network resembles, in its properties, the social media network collected we also investigated a number of distributions and scored for similarity of shape. Figure 2 shows a comparison of the degree distributions of both out collected and synthetic network.
Fig. 2. A comparison of collected network degree distribution (Top) and the generated network degree distribution (Bottom) on a log-log plot
Using Clauset’s method of identifying and comparing heavy tailed distributions we compared a power law fit of both the synthetic and collected network data for degree distributions and found that a power law fit better represents both the collected data and the generated data using a goodness of fit calculation reliant on the Kolmogorov-Smirnov distance of parametrized distributions, and summarized in a log-likelihood ratio calculation. The p-value for the log-likelihood estimator is also calculated and can be understood as the significance value of the chosen distribution when compared to its counterpart. We summarize the results in Table 1 and Table 2 for the synthetic network network and the collected network respectively.
R (loglikelihood ratio) | p-value | First Distribution | Second Distribution |
-1.398 | 0.162 | Power Law | Log-normal |
0.339 | 0.734 | Power Law | Exponential |
-1.677 | 0.021 | Power Law | Truncated Power Law |
Table 1. A comparison of better fit models of candidate distributions for the generated network.
R is the log-likelihood ratio. Positive values denote that the data is better fitted by the first distribution, while negative values denote that the data is better fitted by the second distribution. The strength of the comparison is accounted for by the size of the R value. The p-value denotes the significance of the fitted model. From these candidate distributions a power law distribution tends to be better fit than the exponential distribution but does not achieve significance in comparison for the generated network. In other words, a strong case cannot be made using this information alone for either distribution.
R (Goodness of fit) | p-value | First Distribution | Second Distribution |
-1.88 | 0.060 | Power Law | Log-normal |
31.03 | 2.127 E(-211) | Power Law | Exponential |
-2.486 | 0.009 | Power Law | Truncated Power Law |
Table 2. A comparison of better fit models of candidate distributions for the collected network.
The power law distribution for the collected data is a good fit model than the exponential distribution and achieves a high significance, still a log-normal or truncated power law distribution seems to provide an even better fit than the power law distribution.
In some scenarios, the complementary cumulative distribution function (CCDF) can provide useful information on the fitting of data sets than the probability distribution function (Clauset, 2007), especially in extreme values of the distribution. Thus, an investigation of the fitting of the model to the CCDF was conducted, and outlined in Figure 3 for the generated network and the collected network respectively.
Fig. 3. A comparison of the fitted probability distribution function and the complementary cumulative distribution function (survival function) for the synthetic network (bottom) and the collected network (top). The fitted model used for the comparison was the power law distribution as it represents a middle-of-the-way acceptable distribution for both network degree distribution comparisons.
A comparison of other candidate distribution plots was also carried out against the CCDF of the original data sets. We found that throughout the degree distribution value-set of the synthetic network, all four candidate distribution seemed to closely match our generated data (Figure 4) with better fits for power law, truncated power law, and lognormal distributions. For the collected network the candidate distributions were a closer match with the exception of the exponential model fit, which diverged from expected fit considerably at higher degree values.
Fig. 4. Candidate distributions are plotted on a log-log scale with the generated network (Top) fitting models matching closely to the generated degree distribution data with slight divergence at high degree values, while the collected network data (Bottom) possessed similar properties the divergence of the exponential distribution was significantly higher at high degree values.
As a test of the similarity of the candidate distributions to both the synthetic and collected data sets, an analysis of the Kolmogorov-Smirnov (KS) distance was conducted. This is important because as we fit the candidate distributions to our data, we allow for the optimization of the fitted models through the optimization of xmin—the minimum value of x from the distribution by which the fitted model would be applicable. We found the optimized xmin of the generated model to be most optimized at 36.0 with an α = 4.884 using a power law fit candidate distribution, while the collected data’s optimized xmin was found to be at 4.0 with an α = 1.961.
We also tested whether that optimized xmin would create greater variance if it was changed to another value when compared to the original data sets. Figure 5 shows a plot of the KS distance for increasing values for xmin with plots of the variance as well as the variance relative to the corresponding values of α.
Fig. 5. For a power law fit the KS distance remains low and negligible for both the synthetic network (Top) and the collected network (Bottom) for almost all values of xmin—the optimized value of x such that the fit is most appropriate. This is the case even though the variance of the comparison σ increases at a non-linear rate.
Discussion
What the results show are important insights into understanding the distributions of both the Twitter network collected from the Twitter API and a synthetic network grown through an agent-based model that attempts to represent the former closely. The most important of which is that power law distribution fittings outperform exponential distribution fittings for both network types for this particular data set and that other heavy tailed distribution models, in turn, outperform those in fitting the data – mainly because of the structure of the networks at either low degree nodes or high degree nodes, or both in some cases. In other words, it is in the tails of the distributions where significant variation from the traditional power law is apparent and thus for better fitting power by distributions that better match tail behavior.
It is also apparent that the generated network was better fitted by any of the candidate distributions across its range more precisely. This can be attributed to two possible explanations: The first is that due to the smaller size of the network,
The first is that due to the smaller size of the network, more idealistic behavior with more diverse fitting possibilities manifested. And, that through the use of the same simple rules but at a larger size, more divergence from the candidate distribution would be more likely to occur. The second explanation was that not enough rules were used to generate the network. Networks are often observed to have multiple social forces acting at any given time. Our aim was to reproduce the network using a small number of simple rules so as to simulate the observed collected network from our data. Thus, it is likely that the generated network, in its current form, still does not capture the essence of our collected network because it does not contain enough complexity. This could be attributed to the better fitting of any ideal heavy tailed distribution for the generated network versus the fitting of only a few candidate distributions for the collected network.
The second explanation was that not enough rules were used to generate the network. Networks are often observed to have multiple social forces acting at any given time. Our aim was to reproduce the network using a small number of simple rules so as to simulate the observed collected network from our data. Thus, it is likely that the generated network, in its current form, still does not capture the essence of our collected network because it does not contain enough complexity and dependence. This could be attributed to the better fitting of any ideal heavy-tailed distribution for the synthetic network versus the fitting of only a few candidate distributions for the collected network.
Additionally, it continues to be unclear by which mechanism the Twitter API returns accounts to the user when requested. And, though we assume that, structurally, the network would continue to be relevant—that is—high centrality nodes would likely still continue to be central, isolates would continue to be isolated etc, if the network’s size approaches a certain critical point, little is known about the relationship between a snowball sampling technique and the full network represented on social media platforms. It is also relevant to consider that the process by which accounts are returned does not fit into the traditional definition of sampling since all network nodes are returned within the core network but their degree distribution is cutoff at a certain path length from source nodes.
Finally, the continued stability of the KS distance at various optimal values of xmin shows that the power law distribution and those with a heavier tail property are quite robust at describing both the generated and collected networks further weakening prevailing assumptions that networks’ degree distributions are exponential in nature.
Conclusions
I showed that a synthetic network can produce degree distributions comparable to that of a collected network from Twitter. The comparability of the degree distributions is established on the fitting model used to explain and describe it. I also showed that power law distributions fit the data-sets for synthetic networks as well as collected networks more closely than an exponential distribution but not as well as a log-normal or truncated power law distribution.
This result was robust under a wide range of optimal minimal values of the degree distribution (xmin), however, issues related to the adjustment of generating rules-set of networks and the collection method of collected twitter data remain to be explored.
Future work should focus on the exploration of sampling methods on social media sites and specifically the Twitter API and on the generating forces and mechanism of growing networks.
References
Clauset, A., Shalizi, C. R., & Newman, M. E. J. (2007). Power-law distributions in empirical data. doi:10.1137/070710111
Alstott, J., Bullmore, E., & Plenz, D. (2014). Powerlaw: A python package for analysis of heavy-tailed distributions. PLoS ONE, 9(1), 1–18. doi:10.1371/journal.pone.0085777
Newman, M. E. J. (2002). Assortative Mixing in Networks. Physical Review Letters, 89(20), 208701. doi:10.1103/PhysRevLett.89.208701
Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Power Laws, Pareto Distributions and Zipf’s Law. Contemporary Physics, 46(5), 323–351. doi:10.1016/j.cities.2012.03.001
Axtell, R. (2000). Why agents?: on the varied motivations for agent computing in the social sciences. Center on Social and Economics Dynamics – The Brookings Institution, (17), 1–23. doi:10.1016/j.cep.2007.02.029
Freeman, L. C. (1978). Centrality in social networks conceptual clarification. Social Networks, 1(3), 215–239. doi:10.1016/0378-8733(78)90021-7
Burt, R. S. (2001). Structural holes versus network closure as social capital. Social Capital: Theory and Research, 1(May 2000), 31–56. doi:Burt_2001
Newman, M. E. J. (2005). Power laws, Pareto distributions and Zipf’s law. Power Laws, Pareto Distributions and Zipf’s Law. Contemporary Physics, 46(5), 323–351. doi:10.1016/j.cities.2012.03.001
Pareto, V. (1964). Cours d’{é}conomie politique. Librairie Droz.
Yule, G. (1925). The growth of population and the factors which control it. Harrison & Sons.
Bott, E. (1957). Family and Social Network: Roles. Norms, and External Relationships in Ordinary Urban Families, London: Tavistock Publications.
Moreno, J. L. (1934). Who shall survive?: A new approach to the problem of human interrelations. Washington, DC, US: Nervous and Mental Disease Publishing Co. doi:10.1037/10648-000
Granovetter, M. (1973). The Strength of Weak Ties. The Strength of Weak Ties. doi:10.1086/225469
Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of “small-world” networks. Nature, 393(6684), 440–2. doi:10.1038/30918
Barabási, A.-L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 11. doi:10.1126/science.286.5439.509
Gabaix, X. (1999). Zipf’s Law for Cities: An Explanation. Quarterly Journal of Economics, 114(3)(August), 739–767. doi:10.1162/003355399556133