This is a project I worked on and wrote when I was a first year PhD student – roughly around 2015-2016. It was intended as a blog form piece. I’ve left the original text as is – there may be typos or occasional incomplete logic, but surely this will be useful to someone.
Introduction
The goal of this project was to provide a role-based social network analysis of corporate personnel and line management for a large government contractor in the Washington, DC area. I took on this project in early 2013. Social network analysis or as it known in the SNA consulting world (those are the guys that run summary measures and call it analysis) Organizational Network Analysis (ONA) can do some interesting things barring deep analysis from the equation which most firms (especially this one) do not want to fund
You’ll note that the analysis was conducted on a single network and that it wasn’t the best type to really get a good understanding of the organization, access to more data was not possible under the circumstances so the analysis had to be performed on the reporting network (what consultants call the formal network).
This leaves out a lot of cool things that can be done with either meta-networks or bipartite (even n-partite) networks. It also leaves out the potential to look at multi-modal networks (more than one type of edge) which would be the real benefit of doing SNA on organizations. For example, the reporting structure of the organization represents that network of authority.
If we had the network of friendships, then we can use a method like the multiple regression quadratic assignment procedure (MRQAP) or even fit the model using an exponential random graph model (ERGM) to measure the non-similarity of the 2 or n networks when superimposed. We can still use an ERGM, but we’d be looking at structure only, and not social effects – BORING. Unfortunately, we can’t do anything super cool with this data because it’s missing all of that – even most attribute data is missing.
So you can see my frustration when I was asked to do this project in 2013 (4 years after I started doing SNA) as I thought I had received this potentially valuable opportunity, but no good data to go along with it. So, I made the best out of it. My goals were modest:
- Provide an easy to understand visual representation of reporting structure
- Identifying asymmetry in reporting structure
- Measure the potential for excess hierarchy and explain consequences of inefficient structure
- Identify clusters, communities and provide insights into their development
- Identifying potential network positions where role overlap may occur
The Data
I received a file with the reporting structure export from the HRIS (Human Resource Information System). I initially received over 20,000 reporting relationship but there was a high number of duplication. The analysis was to be conducted on what government contractors call the “Indirect” employees. That means that it would be those employees that are not being billed to a contract – they’re HQ/Corporate Office types. Now there was only about 2000 of those and my expectation for a reporting network is that each node has a maximum of 1 outbound edge or that every node would have a maximal out-degree of 1.
In other words, it would look like a cellular network with a scale-free degree distribution, and superimposed on a core-periphery topology.
After data formatting and cleanup I ended up with roughly 2000 edges as expected. A graph was then built from the data and summary measures were run. Additionally, for specific cases, we went to a deeper level analysis and added context. Since context would be considered private, I won’t be including that analysis here.
Summary Measures
Let’s start with some summary measures: the average degree is 1 and the distribution is tailed, not sure if it would fit better with a power law or an exponential, and for this project, it doesn’t really matter. That’s an expected emergence. The in-degree distribution is shifted right by 1 from total degree since we have to subtract 1 as in the out-degree to get in-degree in this case. And, since every node has (exactly) an out-degree of 1 the out-degree distribution is uniform.
All the weighted measures make no sense here since our network is binary/dichotomous.
The Eccentricity distribution is not terribly interesting – it’s a bell curve centered at 5 with 5 levels to one side and 6 levels to another, giving us an 11 hop pathway from one side of the corporate network to another. This could be seen as interesting but needs another piece of information to be usable. The eccentricity is suggesting a high level of hierarchy – in fact, the diameter (the maximum eccentricity of all nodes) was 10. Now this is important to understand because from an organizational perspective it means that
Information could potentially take a very long time to get from one side of the corporate network to the other. Now, from an information theory perspective that is probably not the most efficient corporate structure for innovation, but government contractors are well-known to have a hierarchical structure that’s almost always based on military thinking. So in our case that actually makes sense.
Closeness centrality also takes on a kind of a bell curve shape. Except it looks a lot more log-normal than anything else. Again with the corporate structure where you have an expectation of having a tree-like network that also makes a lot of sense since the closeness (or farness) on any node i should be about half for the majority of the network since tree like structures tend to have some symmetrical properties as well.
Really, the same applies for betweenness except betweenness takes on its usual form with a heavy-tailed distribution skew right, with heavy inequalities in the tendency of the top nodes, in our case that would be the CEO, to be along the shortest path between any two nodes. From my experience, for small and medium-size networks you can always make a pretty good assumption that the betweenness distribution will be power law making things like calculating an average or a variance theoretically unfounded.
When I began looking at other types of measures such as the graph density, eigenvector centrality, clustering coefficient, page-rank – all of them really told me things that I would’ve already expected. Let me summarize: power is highly distributed among those with a core position in the network, a high number of direct reports and a high number of indirect reports.
So basically, if you are at a higher part of the tree structure of the network you get assigned a higher score for basically the majority of the summary measures and the node level measures. This is completely expected because the network structure itself is not very interesting. In fact, a corporate reporting structure at a government contractor can really be viewed as somewhat of an idealized network topology.
In other words, we should give the reporting structure of a government contractor an actual network name like scale-free networks or Erdos-Renyi, because it is such a beautiful idealized structure that meets almost all of our expectations and assumptions about how it should look.
I even looked into running modularity algorithms on the network to see if it would divide up into the teams that were all reporting to the same manager, and again I got a result which I completely expected. The reporting structure had self-organized into siloed groups with very little in common. I ran the Blondell (2008) algorithm with the Lambiotte resolution set to 1 and Gephi spit out a modularity of about 0.953 which is tremendously high (modularity is a score from 0 to 1,1 being the highest).
The algorithm divided up the network into about 51 communities, but those were not necessarily distinct so the number of communities in this particular case is mostly arbitrary and should be taken only to mean that the network is highly fractured. Again, that was something that I completely expected knowing full well that this organization has achieved its growth over time through acquisition, add in a preference for a military hierarchy, a culture predisposed to a type of government bureaucracy, and honestly what you get is not something that’s entirely outside of the norm for a government services contractor.
At this point, it will be easy to take the summary measures, list them out and create some action plan for the organization to reduce hierarchy, but it is not necessarily the case that the hierarchy that exists at the corporate level is not needed. Without the multiplex networks that I spent the introduction discussing, there are some really interesting questions here that can’t be answered without having more data.
Results
Analyzing the network statistically simply doesn’t make sense here. There is just not enough complexity for a statistical model to tell us anything worthwhile that we wouldn’t be able to guess on our own. And, in the case of running a model like an ERGM (exponential random graph model), I really wouldn’t have a clue on how to interpret the coefficients since I have nothing else to compare my model to.
Off-the-shelf data sets, from what I’ve seen, are not easily compared to this skeletal network, and comparison methods like using the hamming distance or a non-parametric test to compare distributions once again won’t tell me anything interesting.
So with the help of some Tableau and Gephi magic, we’ll stick to the visualizations for now, and perhaps later we can look at simulating the network using the agent approach – maybe for a conference paper.
Structure
Below (first graph) you’ll see the first telling graph with node size being the degree (remember all nodes have an out-degree of 1 so this is really reflective of the in-degree more than the out-degree) and the color (orange=low, blue=high) is the betweenness score. Note the only dark blue node close to the middle center. As it turned out that was the COO of the company. So while the company as a whole had very few high betweenness nodes because of its tree-like structure, the hiring and management process self-organized such that the top operational node (the Chief Operations Officer) ended up well placed to intercept novel information (see Granovetter’s Strength of Weak Ties). This alone was super interesting.
Next up was another interesting emergent property. The graph below shows the same graph as above except color in this one stands for Closeness Centrality. Again Orange low, Blue/Purple is high. Note the COO being dark blue again, but interestingly, note the most central node (upper middle) being a dark Orange – that’s the CEO. So the CEO is technically not “close” to the rest of the network.
Another way of saying it is that the CEO falls pretty high on the farness index. Now, this is happening while the CEO is surrounded by a collection of high closeness nodes. We can intuit something important here, that either through self organization or top-down decision-making the organization is designed to form a blockade of information to the CEO, while at the same time an avalanche of information to the CEO’s direct reports and especially the COO. One can debate the pros and cons of that but the centrality indexing we just performed is definitely illustrating some emergent properties more amenable to analysis through simulation.
Below we visualize the eigenvector centrality (higher score if my friends have more friends) which followed along the same insights we gained from the first two centrality indexes. Plus EVC is more interesting when there are more cross links between nodes (triadic closure).
The next two graphs are where we size the nodes by betweenness and degree but use color as the community detection and segregation visual.
I used the Gephi standard built-in community detection algorithm since this project needed to be kept short and sweet, plus for this size and a low number of attributes, the majority of community detection algos will give me something similar anyway. Interestingly, the CD algorithm divided the network into what appears to be teams and departments that fall in line with what I knew about the departments anyway, while some were just inserted into one fairly large community.
My guess is that those inserted into one large community (Grey color) even though they were not connected directly in the formal network were not cohesive enough to be given their own community. Alone, this insight is not very powerful. After all, I can just change the resolution of the algorithm to something smaller and I would get a larger number of smaller communities that would be exactly representative of the departmental and team clusters, but that’s just changing my methods to suit the data and adds nothing new.
At some point, I should go back and look into the the merger and acquisition timeline of this company because intuition tells us that there may be a difference in how clustered certain part of the corporate network might be based on whether that part of the network grew organically or through a corporate acquisition.
Distributions
The last part of the visual analysis of the network continues the theme of looking at centrality indexes through the distributions themselves.
Looking at network distributions you tend to see that they pretty much all look roughly the same. For example degree distributions will tend to be skewed, fat and/or heavy tailed, and clustering coefficient distributions usually have a weird-looking squiggly line, reflecting that the clustering coefficient tends to have a preference for taking on values of certain fractions (for example reflective of dyads, triads, and cycles which tend to dominate network topologies).
But, a multi-dimensional visual analysis of the distributions could be insightful and interesting. So, with a little Tableau magic, the whole task becomes a cake-walk.
The first interesting part of the visual analysis would be to look at the distribution of community size as colored by the total amount of betweenness score enclosed within that community. This may be difficult to grasp as seldom does the traditional social network analysis approach recommend using this method of analysis from what I’ve seen, so you may not be used to understanding what it means.
Basically, once I ran the community detection algo, I plotted the size of each community identified and colored each bar in the histogram by the total amount of betweenness in that community. In other words, I colored the bars by the total number of shortest paths that go through that community. Think of it as community betweenness instead of node betweenness.
What you get is an identification of which communities will tend to be more central to the network as a whole since more short pathways flow through them. In the graph below, Red is low, Salmon is medium, and Blue is high. Interestingly, there were only 3 communities that had high pass-through potential, and while further investigation is needed, I would guess that those communities/departments are more operational in nature – like an IT department.
Next was a look at the Eigenvector Centrality index, colored by the Eccentricity dimension – that is – the distribution of the “friend of my friend” score as it is grouped (by color) from the center of a network.
Again, no surprise there for a hierarchical organization: Almost all low EVC nodes have a high eccentricity (they’re very far from the center of the network) while high power nodes are closer to the center. In fact, I’d imagine that almost every centrality measure we’ll look at is somehow governed by eccentricity.
Well, since eccentricity might seem to be a dominant social effect in this network, let’s look at it directly.
This is what the below graph shows by helping us visualize the total degree within each eccentricity level (there should a total of 9 of course for our data set). This is a contained graph, meaning the total degree is summed within each eccentricity bar. As a beautiful demonstration of organizational behavior we find that the highest number of total connections (degree) is contained at level 4 and 5 from core.
OB people call that “middle management” and it is where most hires are usually made in most organizations. For an organization that tries to be lean because their business model is to provide governments with services, I’m not sure that’s super efficient but since hierarchy embeds itself within government contractor’s culture, it is not unexpected.
Next is a less interesting visual mainly looking at the total number of connections a node possesses plotted against the contained betweenness. It makes sense that there are a large number of nodes with almost zero betweenness (mostly from the periphery), and then high-degree nodes with also low betweenness (remember the CEO!).
Closeness, since it is somewhat directly associated with Eccentricity in our data, seemed to also be somewhat of an important measure. Mainly because the observable social effects or the actor-oriented behaviors in our data set seem to be dominated by a node’s position in the structure more than anything else.
Note the marked part of the distribution – a group of 15 nodes with a very high closeness but a very low eccentricity. From one of the earlier network visualizations we saw that the CEO had a low closeness but was surrounded by subordinates that had high closeness, but viewing the distribution in this way shows us just how much those 15 nodes are close to the network as a whole. It’s also in line with an assumption of a core-periphery network.
The final distribution we’ll look at is a little quirky because in most contexts it’ll be nonsensical (is that a word?).
Here, we look at the betweenness centrality distribution as colored by closeness. That is, how many nodes have a particular betweenness score with a low or high closeness.
It was interesting that there seems to be some kind of inverse relationship. For example, the most left bar shows a group with very low or almost no betweenness but the majority of (contained) closeness score of the whole network. This is suggesting that, in general, nodes are limited to either being brokers or non-brokers (by closeness). The meaning behind this is not entirely clear but it could be that a social effect based on career choices that focus on either management or technical tracks is in play.
Conclusions
This project was a good example of a modern, exploratory and mainly visual analysis of a corporate network.
My reliance on visual analytics, distribution, and structural analysis is not something new, but it does leave behind a certain academic element which in this particular case was not just not viable due to the low dimensionality of the data.
If I decided to run an ERGM, I would just get a collection of structural covariates in a model which confirms a lot of the things that we already know about corporate networks – a boring exercise in self-gratification. I wouldn’t even be able to include things like relating salary levels to structure for example because I just didn’t have that data – so really nothing would be very interesting.
So instead of going down the traditional path of analysis, I thought I’d look at the connections between the centrality measures themselves and I think it paid off. If nothing else it was a good refresher for over 9 years of social network analysis experience.
Another interesting thing that came out of this was that I was able to begin to intuit the node/actor level behavior involved in emerging this specific network structure, and having looked at a lot of multi-dimensional visualizations of the network and its properties, finding out out that eccentricity dominated will be paramount for when I put together an agent-based simulation for the data.
As always, questions or comments are welcome below!