Big Data and Privacy

Earlier this week, the President's Council of Advisors on Science and Technology (PCAST) released a seventy two page report on the intersection of Big Data and Privacy with an unoriginal title of:  Big Data And Privacy: A Technological Perspective.  It started by first establishing the groundwork for the traditional definition of privacy, as defined by Samuel Warren and Louis Brandeis in 1890.  These individuals stipulated that privacy infractions can occur in one of four ways:

  1. Intrusion upon seclusion.  If a person intentionally intrudes upon the solitude of another person (or their affairs), and the intrusion is seen as "highly offensive" then an invasion of privacy has occurred.
  2. Public disclosure of private facts.  If a person publishes private facts, even if true, about someone's life - an invasion of privacy has occurred.
  3. Defamation, or the publication of untrue facts, is an invasion of privacy.
  4. Removing personal control of an individual's name and/or likeness for commercial gain is an invasion of privacy.

These infractions basically come down to a removal of the control that an individual has over various aspects of their life (being left alone, selective disclosure, and reputation), and PCAST tends to agree as they state a couple of times throughout their report about the need for selective sharing and anonymity.  The report went on to address a few philosophical changes in our mindset about privacy that were needed in order to better enable the successful implementation of the five aforementioned recommendations:


  • We must first acknowledge that private communication interception is easier
  • We need to extend "Home as one's castle" to become "The Castle in the Clouds"
  • Inferred Private facts are just as stolen as real data
  • The misuse of data and loss of selective anonymity is the key issue.


The report goes on to state that the majority of the concern is with the harm done by the use of personal data and that the historic way of preventing misuse of personal data has been in controlling access; a measure that is no longer made possible in today's nebulous world of data ownership.

Personal data may never be, or have been, within one's possession.

From public cameras and sensors to other people using social media, we simply have no control over who collects data from whom; and we likely never will again.  Which raises the question of who owns the data and who controls it.

And while the Electronic Frontier Foundation would complain (again) that this failed to address metadata (in spite of it equating metadata to actual data in the first few pages), this report comes on the eve of a unanimous vote in the House to rein in the National Security Agency making this a big week for big data privacy advocates.

NSA and Graph Theory

Foreign governments are in a state of panic and are looking to "balkanize" the internet, domestic judges are ruling the methods unconstitutional, and lawmakers are looking to turn off the utilities at NSA plants.  However, many people (myself included) take some solace in the fact that we may not be under as much scrutiny as we might think.  We like to assume that if we can't make sense of that much information, then no one can; and the more we know about analyzing data, the more often we jump to that logical fallacy.

I was among those.  Having taken graduate courses in data analytics, I was operated and espoused the belief that the NSA can't possibly analyze all of the information that they're collecting through PRISM, CO-TRAVELER, and Landscaping; and while I was not wrong, it turns out that they don't actually have to.  After a conversation with +Andreas Schou, I was introduced to Graph Theory; the methodology that scientists have been using to make sense of large amounts of relational data for years.

Graph Theory is the study of graphs, which are mathematical structures used to model the relationships between objects.  These objects are connected by "edges" which map the objects based on observed or mathematically inferred relationships.  Graph theory is primarily used in the study of discrete mathematics, but can also be used in computer science to represent networks of devices, data, or information flow; sociology to measure an actors' prestige (Six Degrees of Kevin Bacon); social network analysis; and analyzing associations within criminal organizations.

This associative analysis can help intelligence analysts determine the relationship between different objects (a credit card can be linked to a cell phone, which can be linked to a person who has a criminal record).  The problem with PRISM (et al) is that the intelligence net that is cast is so large that information overload is a serious problem.  How does the NSA, or any large data company (Google, Amazon, Facebook), handle these large data sets?  As we know, any savvy criminal will have more than one phone and almost everyone has more than one e-mail address, credit card, or digital avatar.  The sheer number of objects contained within a graph that attempts to map every transaction, phone call, and relationship will quickly become unmanageable.
Within graph theory, algorithms are relied on to handle and split the complex data into smaller, more manageable graphs.  I'm not a data scientist, so I'm very fuzzy on the specifics of this, but large complex graphs can be split into smaller graphs through algorithmic computation.  These smaller graphs isolate a section of objects that are known to be of interest to law enforcement agencies, and then this data can be analyzed.  For example, if the Los Angeles Police Department picks up a known fugitive and determines that his phone number is 999-4442, then using graph theory the NSA could extract a subgraph of information relationships deemed most relevant to 999-4442; such as that fugitive's credit card, his burner phone, his favorite pizza parlor's phone number.  Ideally, contained within this subgraph of information will be a link to another, unknown, criminal who may be participating in illegal activities.

Mr. Schou posits that through a combination of PRISMCO-TRAVELER, and Landscaping information, the NSA can create a relational graph of virtually everyone in the world.  The NSA's three degrees methodology for determining from whom they collect information is enough to guarantee that almost everyone is going to end up on the NSA's radar. Mr. Schou goes on to make the distinction that since the NSA is not actually collecting data against persons in their new metadata surveillance methods, but against phone numbers, credit card numbers, and virtual avatars, the surveillance net quickly reaches an exponential growth rate.  For example, take a look at how this "three degree" methodology is explained with this slideshow:



Going back to our example involving the Los Angeles Police Department, the fugitive with the phone number of "999-4442" will enable the surveillance of 125,000 individuals; and this just looks at the data collection associated with PRISM.  When you add in CO-TRAVELER and Landscaping methodologies, that Los Angeles fugitive is going to cast a pretty wide net.  When you consider that some phone numbers are going to be related to significantly more than 50 individuals (have you ever called Microsoft's Tech Support?) then the exponential increase from this "three hops" rule is going to be infinite.

So who does this "Three Hops Rule" actually protect?