Making the Invisible Visible: SNA of the NSA
For weeks now we have been watching with awe, and dismay as the National Security Administration (NSA) spying program story unfolds. For scientists and practitioners that work with social and organizational network analysis it is especially important for us to be engaged in this topic and to realize that the methods that we spend an uncounted amount time of our professional and academic lives, studying, developing, promoting and practicing are being used as a weapon to spy on fellow American citizens.
What I’ve done in this post is try to explain how “meta data” analysis works for those who are not in the field. I used my own e-mail network as an example of how the NSA **might** be doing its analysis. There is enough hinting from the Guardian and other media outlets (as of the writing of this post) that meta data is actually nothing more than network data, which has attributes (details of a person of interest) later superimposed. Since the science is difficult to understand at first look, using real data (my own) to help others understand how this work felt right. Below you will find an explanation (in detail) of how this method works and my own personal e-mail network analyzed and visualized for you (if you scroll down you will see a picture of the analysis). I removed the names and/or any identifying information to protect my friends and colleagues.
My objective is not to make a political statement, only to inform, and to explain that not all scientists and practitioners in the field engage in unethical or illegal activity. All our work is done with consent of each individual participating, and this is important to understand as a main point of difference. By we, I mean, academic researchers, consultants and educators in general.
At this point that assumption that the collection of metadata by the NSA is in fact social network data is still premature, but there is a lot of evidence mounting from sources like the Guardian and several smaller American media outlets that what is referred to by the NSA as metadata is actually data related to nodal positions (persons) in a network structure. People in the social network analysis community have known for years that one way to catch a terrorist is to look at that terrorist’s immediate network structure (the people they communicate with) especially what we call the ego network (the first-degree connections that the terrorist has relationships with).
All of us in the community have always considered the ethical implications of our work, and in both the academic and business side of the discipline we have set extremely stringent guidelines on the use of network data–rightly so of course. It doesn’t take a network analysis expert to note the powerful information that network diagrams present to any given user. As someone who has used and practiced the science in countless operational and business situations, I can attest to the power of the science to solve previously unsolvable issues in organizations and in other disciplines such as marketing, and social entrepreneurship. This is why we have such a strong culture of ethical data collection and dissemination in the business of social network analysis-because the methods are so powerful and can be easily abused.
Having said that, I think it’s important to help those without experience in the discipline understand how this actually works and here we go to accomplish that goal.
How “meta data” or Social Network Analysis (SNA) Works
The metadata that the NSA says it collects is composed of things like name or telephone number, or email address, or any identifying piece of information that allows them to build information into a single record. That record usually contains all information related to a single person. What it might also include is a list of connections that person has with other persons–by definition that is the meta data–data that is part of the record but not based on the content itself (like what a person ACTUALLY says). Meta data is concerned with who talks to whom, not what they actually say. It is the first stop on a long process of analysis. think of it as a first step.
In the practice of SNA, we define connection based on the question that we are trying to answer. For example, if we’re looking at a network and we’re trying to understand who is the most trustworthy person in a given network, we will define connection as “trust”, and we will list all the people who that person trusts and all the people who trust that person.
The NSA defines connection simply by who you talk to. They would be interested in who talks to whom, which by definition includes, anyone you trust, respect, dislike, or anything else. It’s a much broader method of data collection-leave no stone un-turned kind of analysis.
For example, a raw metadata file at the NSA might look something like this:
Period Collected: 07-15-2013 22:00 to 08-15-2013 22:00
As I mentioned the metadata file will contain some identifying information such as email or an IP address that helps to track a person, and perhaps some information that is relevant for legal issues such as whether the person is an American citizen or not.
The important part of the meta-information are what we call edges (we can also call them ties or links) in the industry. An outgoing edge (on the map below an edge is an arrow going from one person to another) could represent a relationship that I initiated with someone. So for example if you look at the first line i.e. John Smith, since this is listed on the outgoing edge part of my metadata that means that I initiated contact with John Smith. I imagine that the NSA has much more sophisticated data that also tells you when the contact was initiated, by what medium (Twitter, Facebook, email, phone, in person, or any other medium), and it will contain information such as how long the contact was for, as well as any flags based on some kind of keyword analysis algorithm.
Now at this point the metadata still does not contain any information about what the contact was about!! (this is key to understand about meta data. they don’t contain content, only the envelope information) So from the metadata all you would find out is who is talking to whom.
The same thing applies with incoming edges, so if you go to the second line of the incoming edge bullet i.e. Jane Doe, that means that Jane Doe initiated contact with me (like an incoming call or w-mail).
Separating out who I contact versus who contacts me may not seem very important to the casual observer, but it is what network studies are actually made of. We call the type of network study that contains incoming and outgoing edges a directional network graph.
Once we have all that information, we do two things with what we call network metrics— those are metrics that allow us to understand the relative importance of each person in the network relative to other people in the network. The simplest way of understanding network metrics is that if you have a lot of incoming connections that (probably) means you’re relatively more important in the network, and if you have a lot of outgoing connections it means that you are actively engaging in reaching out to the rest of the network. so what was simple arrows a second ago now starts to illustrate a person of interest’s behavior!
We try to look at patterns both qualitatively and quantitatively in the network to help identify which person in the network should we focus our efforts on, and do a deeper analysis, which may include looking at the content of that person’s communication patterns, including email, phone records, and social media statuses. This is the most fundamental type of analysis and is called nodal analysis. [Note: what we do in Management Consulting is slightly different, and remember we do everything with consent of ALL participants).
There are other types of analysis that we use to understand networks. For example, dyadic analysis tries to understand the relationship itself without focusing on the person/node in the network, or we can look at the macro/network metrics as a whole i.e. density, clustering, modularity, diameter etc.
In most instances we use visualizing software that helps us understand large data sets, because sometimes the naked eye can see patterns that some of the best quantitative modeling available cannot see. Though, a lot of the time when the network gets too big, the visualization rarely yields ant concrete themes.
To show you a real email example below. I’ve put up the network graph of my personal email network without the names or identities of the people who I have communicated with. Take a look at the graph and then below we will discuss some of it to help you understand how the NSA [probably] does this.
Let’s start with the basics: First of all, each circle in the network graph represents an email account i.e. a person. Each arrow represents an email that is going from that person to another person. The NSA may choose a different representation of each arrow, such as a phone call, a tweet, or a “like” on Facebook. The size of each circle in this particular graph represents what we call in-degree, a number that represents how many incoming links a person has. The larger the circle, the more incoming e-mails that person gets from the people around them. In other words, an arrow going from one node (circle/person) to another represents that person sending an e-mail to the recipient. If we graph it in such a way, we can very quickly and easily tell who are the most important nodes in terms of popularity. The idea is if you’re getting a lot of emails or calls from the network then you must be an important person to that network. The business applications of the model allows us, for example, to identify people in companies that are highly sought by their coworkers for advice or information relevant to getting the work done. The NSA would use it to identify leaders or members of terrorist organizations.
Again in this particular graph, I assigned the color of each group of nodes based on an algorithm that determines cliques and clusters. The idea there is that I’m able to determine, mathematically, where in-groups and small communities have developed. If you wanted to go after terrorists, this might help in identifying groups that specialize in a specific region of the world, or for example groups that have a particular functional responsibility, such as planning an attack. This is based on the premise that more communication within a particular group means that they are closer together, and likely work together more than the rest of the network does. Think of the top 5 people you talk to on a daily basis–that’s probably your clique/in-group.
The algorithm in the case of my personal email network is almost spot on! The different clusters and groups represent different parts of my life such as social life, work life and my academic network.
NOTE: There’s something that you have to understand about the implications of the network graph that you are looking at in front of you. Unless I’m able to collect the metadata of **ALL** the people who I have talked to or that have talked to me at any given period of time none of the patterns that you see in front of you become clear. Remove any central node and the pattern changes drastically. That’s why collecting all the data is very important (and the main reason no US person is free from infringement on their privacy).
Now the NSA claims that it is only collecting data from second-degree contacts–well, what you see in front of you is only data from my first degree contacts. That is roughly 600 contacts with about 3000 emails, and in the space of only six months, from only one of my several email accounts. If the NSA collects data in only those six months to the second degree, under the assumption that my contacts have roughly the same number of contact as I do, the NSA would have to track metadata of about 700×700 contacts, or about 490,000 people[/highlight]–many of whom will have nothing to do with who I am and what I do. This is only in six months of course. Well you guessed it the NSA DOES look at second and in certain cases, third degree contacts.
No wonder the Guardian reports that the NSA’s XKeyscore program has collected 850 billion “call events” in the last five years alone. If it is determined that the NSA uses network analysis for its metadata collection, I guarantee to you that at least half of the United States citizenry has had a record pulled up by the NSA at some point. This is a requirement of using S.N.A to understand networks and to find terrorists–complete data collection.
Anyway, let’s get back to the explanation…
If you go back to the map that I posted, every single one of those people who is represented by a larger circle than the rest becomes a suspect and may have the content of the emails and calls pulled up, looked at, and investigated strictly because they were obliged to communicate with a person of interest that was determined to be so by a quantitative model.
And then we have to look at other position in the network for other kinds of patterns. For example network brokers tend to be people who connect 2 or more clusters/groups together, periphery specialists tend to be on the fringe of a network, and power brokers are those who essentially control information flow inside a single cluster. Identifying and investigating each type helps us to identify different roles in any network.
What I’m hoping that you start seeing is that this so-called meta-data analysis methodologies leaves no one untouched. People should understand this first, and then make a decision on whether they are willing to put up with it for the sake of their safety and security, or not.