NSA and Graph Theory

Foreign governments are in a state of panic and are looking to "balkanize" the internet, domestic judges are ruling the methods unconstitutional, and lawmakers are looking to turn off the utilities at NSA plants.  However, many people (myself included) take some solace in the fact that we may not be under as much scrutiny as we might think.  We like to assume that if we can't make sense of that much information, then no one can; and the more we know about analyzing data, the more often we jump to that logical fallacy.

I was among those.  Having taken graduate courses in data analytics, I was operated and espoused the belief that the NSA can't possibly analyze all of the information that they're collecting through PRISM, CO-TRAVELER, and Landscaping; and while I was not wrong, it turns out that they don't actually have to.  After a conversation with +Andreas Schou, I was introduced to Graph Theory; the methodology that scientists have been using to make sense of large amounts of relational data for years.

Graph Theory is the study of graphs, which are mathematical structures used to model the relationships between objects.  These objects are connected by "edges" which map the objects based on observed or mathematically inferred relationships.  Graph theory is primarily used in the study of discrete mathematics, but can also be used in computer science to represent networks of devices, data, or information flow; sociology to measure an actors' prestige (Six Degrees of Kevin Bacon); social network analysis; and analyzing associations within criminal organizations.

This associative analysis can help intelligence analysts determine the relationship between different objects (a credit card can be linked to a cell phone, which can be linked to a person who has a criminal record).  The problem with PRISM (et al) is that the intelligence net that is cast is so large that information overload is a serious problem.  How does the NSA, or any large data company (Google, Amazon, Facebook), handle these large data sets?  As we know, any savvy criminal will have more than one phone and almost everyone has more than one e-mail address, credit card, or digital avatar.  The sheer number of objects contained within a graph that attempts to map every transaction, phone call, and relationship will quickly become unmanageable.
Within graph theory, algorithms are relied on to handle and split the complex data into smaller, more manageable graphs.  I'm not a data scientist, so I'm very fuzzy on the specifics of this, but large complex graphs can be split into smaller graphs through algorithmic computation.  These smaller graphs isolate a section of objects that are known to be of interest to law enforcement agencies, and then this data can be analyzed.  For example, if the Los Angeles Police Department picks up a known fugitive and determines that his phone number is 999-4442, then using graph theory the NSA could extract a subgraph of information relationships deemed most relevant to 999-4442; such as that fugitive's credit card, his burner phone, his favorite pizza parlor's phone number.  Ideally, contained within this subgraph of information will be a link to another, unknown, criminal who may be participating in illegal activities.

Mr. Schou posits that through a combination of PRISMCO-TRAVELER, and Landscaping information, the NSA can create a relational graph of virtually everyone in the world.  The NSA's three degrees methodology for determining from whom they collect information is enough to guarantee that almost everyone is going to end up on the NSA's radar. Mr. Schou goes on to make the distinction that since the NSA is not actually collecting data against persons in their new metadata surveillance methods, but against phone numbers, credit card numbers, and virtual avatars, the surveillance net quickly reaches an exponential growth rate.  For example, take a look at how this "three degree" methodology is explained with this slideshow:



Going back to our example involving the Los Angeles Police Department, the fugitive with the phone number of "999-4442" will enable the surveillance of 125,000 individuals; and this just looks at the data collection associated with PRISM.  When you add in CO-TRAVELER and Landscaping methodologies, that Los Angeles fugitive is going to cast a pretty wide net.  When you consider that some phone numbers are going to be related to significantly more than 50 individuals (have you ever called Microsoft's Tech Support?) then the exponential increase from this "three hops" rule is going to be infinite.

So who does this "Three Hops Rule" actually protect?