Data Mining, The Internet, and Counter Intelligence

Data Mining, The Internet, and Counter Intelligence

We can, and do, talk about Data Privacy and Ownership until we're blue in the face; we also talk about how seriously screwed up some of the things the National Security Agency did were, but we never really harp on the obvious fact:  All the security in the world doesn't do a damn bit of good if you give your information away; one ambitious JSON project over on Github called Looking Glass is capitalizing on just that very fact.

In fact, as of this publication, they have data mined over 139,361 resumes belonging to military and civilian officials within our nation's Intelligence, Surveillance, and Reconnaissance (ISR) fields.  Those handy little endorsements have been used by job seekers to categorize them into fields (e.g. "Security Clearance" or "ISR") where data miners have been more than willing to scoop that  information up.

Why It's Not About Privacy

Why It's Not About Privacy

I've faced some opposition recently based on my views that the Electronic Frontier Foundation did a disservice to their constituents by focusing so much of their efforts on privacy, rather than data ownership.  With that in mind, I pose two ethical scenarios to help illustrate my (and the Guardian's) point that solving the data ownership debate will solve far more than just the privacy debate.

Our laws are focused on data collection, but the existence of data is not the concern; it’s the usage and sharing of data.  In today’s interconnected world, individuals are no longer as concerned about what a given company knows about them, but how it’s used and with whom that information is shared.  These are issues that cannot be solved when we limit the scope of our conversation to privacy, but must be evaluated in the larger discussion of establishing ethical data ownership legislation.

NSA and Graph Theory

Foreign governments are in a state of panic and are looking to "balkanize" the internet, domestic judges are ruling the methods unconstitutional, and lawmakers are looking to turn off the utilities at NSA plants.  However, many people (myself included) take some solace in the fact that we may not be under as much scrutiny as we might think.  We like to assume that if we can't make sense of that much information, then no one can; and the more we know about analyzing data, the more often we jump to that logical fallacy.

I was among those.  Having taken graduate courses in data analytics, I was operated and espoused the belief that the NSA can't possibly analyze all of the information that they're collecting through PRISM, CO-TRAVELER, and Landscaping; and while I was not wrong, it turns out that they don't actually have to.  After a conversation with +Andreas Schou, I was introduced to Graph Theory; the methodology that scientists have been using to make sense of large amounts of relational data for years.

Graph Theory is the study of graphs, which are mathematical structures used to model the relationships between objects.  These objects are connected by "edges" which map the objects based on observed or mathematically inferred relationships.  Graph theory is primarily used in the study of discrete mathematics, but can also be used in computer science to represent networks of devices, data, or information flow; sociology to measure an actors' prestige (Six Degrees of Kevin Bacon); social network analysis; and analyzing associations within criminal organizations.

This associative analysis can help intelligence analysts determine the relationship between different objects (a credit card can be linked to a cell phone, which can be linked to a person who has a criminal record).  The problem with PRISM (et al) is that the intelligence net that is cast is so large that information overload is a serious problem.  How does the NSA, or any large data company (Google, Amazon, Facebook), handle these large data sets?  As we know, any savvy criminal will have more than one phone and almost everyone has more than one e-mail address, credit card, or digital avatar.  The sheer number of objects contained within a graph that attempts to map every transaction, phone call, and relationship will quickly become unmanageable.
Within graph theory, algorithms are relied on to handle and split the complex data into smaller, more manageable graphs.  I'm not a data scientist, so I'm very fuzzy on the specifics of this, but large complex graphs can be split into smaller graphs through algorithmic computation.  These smaller graphs isolate a section of objects that are known to be of interest to law enforcement agencies, and then this data can be analyzed.  For example, if the Los Angeles Police Department picks up a known fugitive and determines that his phone number is 999-4442, then using graph theory the NSA could extract a subgraph of information relationships deemed most relevant to 999-4442; such as that fugitive's credit card, his burner phone, his favorite pizza parlor's phone number.  Ideally, contained within this subgraph of information will be a link to another, unknown, criminal who may be participating in illegal activities.

Mr. Schou posits that through a combination of PRISMCO-TRAVELER, and Landscaping information, the NSA can create a relational graph of virtually everyone in the world.  The NSA's three degrees methodology for determining from whom they collect information is enough to guarantee that almost everyone is going to end up on the NSA's radar. Mr. Schou goes on to make the distinction that since the NSA is not actually collecting data against persons in their new metadata surveillance methods, but against phone numbers, credit card numbers, and virtual avatars, the surveillance net quickly reaches an exponential growth rate.  For example, take a look at how this "three degree" methodology is explained with this slideshow:



Going back to our example involving the Los Angeles Police Department, the fugitive with the phone number of "999-4442" will enable the surveillance of 125,000 individuals; and this just looks at the data collection associated with PRISM.  When you add in CO-TRAVELER and Landscaping methodologies, that Los Angeles fugitive is going to cast a pretty wide net.  When you consider that some phone numbers are going to be related to significantly more than 50 individuals (have you ever called Microsoft's Tech Support?) then the exponential increase from this "three hops" rule is going to be infinite.

So who does this "Three Hops Rule" actually protect?

Privacy, PRISM, and the flawed 1984 Argument

Privacy, PRISM, and the flawed 1984 Argument

"Privacy, in other words, involves so many things that it is impossible to reduce them all to one simple idea.  And we need not do so."

Privacy means different things to different people, and it's almost insultingly oppressive to attempt to superimpose your definition of privacy onto others.  For example, my personal belief is that anything conducted in or seen from the public domain should hold no expectation of privacy.  However, others might disagree with that assertion and deem me to be a government stooge.

Who am I to superimpose my privacy expectations onto them?  Unfortunately, in a democratic republic, that's what we do every time we vote.  That's what our leaders do every time they make decisions, and that's what PRISM and CISPA is all about.