Data Science

Data Mining, The Internet, and Counter Intelligence

Data Mining, The Internet, and Counter Intelligence

We can, and do, talk about Data Privacy and Ownership until we're blue in the face; we also talk about how seriously screwed up some of the things the National Security Agency did were, but we never really harp on the obvious fact:  All the security in the world doesn't do a damn bit of good if you give your information away; one ambitious JSON project over on Github called Looking Glass is capitalizing on just that very fact.

In fact, as of this publication, they have data mined over 139,361 resumes belonging to military and civilian officials within our nation's Intelligence, Surveillance, and Reconnaissance (ISR) fields.  Those handy little endorsements have been used by job seekers to categorize them into fields (e.g. "Security Clearance" or "ISR") where data miners have been more than willing to scoop that  information up.

Big Data and Privacy

Earlier this week, the President's Council of Advisors on Science and Technology (PCAST) released a seventy two page report on the intersection of Big Data and Privacy with an unoriginal title of:  Big Data And Privacy: A Technological Perspective.  It started by first establishing the groundwork for the traditional definition of privacy, as defined by Samuel Warren and Louis Brandeis in 1890.  These individuals stipulated that privacy infractions can occur in one of four ways:

  1. Intrusion upon seclusion.  If a person intentionally intrudes upon the solitude of another person (or their affairs), and the intrusion is seen as "highly offensive" then an invasion of privacy has occurred.
  2. Public disclosure of private facts.  If a person publishes private facts, even if true, about someone's life - an invasion of privacy has occurred.
  3. Defamation, or the publication of untrue facts, is an invasion of privacy.
  4. Removing personal control of an individual's name and/or likeness for commercial gain is an invasion of privacy.

These infractions basically come down to a removal of the control that an individual has over various aspects of their life (being left alone, selective disclosure, and reputation), and PCAST tends to agree as they state a couple of times throughout their report about the need for selective sharing and anonymity.  The report went on to address a few philosophical changes in our mindset about privacy that were needed in order to better enable the successful implementation of the five aforementioned recommendations:


  • We must first acknowledge that private communication interception is easier
  • We need to extend "Home as one's castle" to become "The Castle in the Clouds"
  • Inferred Private facts are just as stolen as real data
  • The misuse of data and loss of selective anonymity is the key issue.


The report goes on to state that the majority of the concern is with the harm done by the use of personal data and that the historic way of preventing misuse of personal data has been in controlling access; a measure that is no longer made possible in today's nebulous world of data ownership.

Personal data may never be, or have been, within one's possession.

From public cameras and sensors to other people using social media, we simply have no control over who collects data from whom; and we likely never will again.  Which raises the question of who owns the data and who controls it.

And while the Electronic Frontier Foundation would complain (again) that this failed to address metadata (in spite of it equating metadata to actual data in the first few pages), this report comes on the eve of a unanimous vote in the House to rein in the National Security Agency making this a big week for big data privacy advocates.

What is Business Intelligence?

Imagine that you have a small, but rapidly expanding, business that finds itself with multiple ways of storing data.  You start with the best of intentions to have one central database for all of your resources, but you've increasingly found yourself with more software suites requiring very different database management systems; and that's a problem.

It's a problem because while this information is loosely related, there is nothing you can do to link the data together.  After all, SQL 2000 does not talk with MySQL 3.23; so how do we analyze the the information that's contained in so many different types of databases?  We first enable data relationships through a process known as ETL or Extract, Transform, Load.

This brief video explains how a company might find itself in this situation and how ETL can assist it in combining data into a central repository known as a data warehouse. This data warehouse is a snapshot of several databases (like those listed in the video) in one central repository so that analysts can turn the data into actionable information.  This process of turning data into actionable information is known as Business Intelligence.

Business Intelligence involves any action required to take a business process (like "enroll a student") to analysis (such as "how many lower income students enrolled in 2014?") to action ("Improve enrollment rates of lower income students").  These steps vary depending on the size and scope of operations, but can typically be reduced into a simple data process which has been succinctly defined by Google:

  • Prepare
  • Analyze
  • Apply

While not every company will require an ETL process, a data warehouse, or an OLAP cube, every company must prepare their data before it can be analyzed.  Similarly, analysis must take place before knowledge can be accurately applied; and every company will have different method of analysis.  The single commonality is that Business Intelligence requires preperation, analysis, and application in order to turn data into profit.

Want to know more?  Check out my book Understanding IT in January 2015 or subscribe to this RSS feed for more updates and teasers during the writing process.  Alternatively, if you have an immediate project that you need help with, please check out my consultation services below.

Writing Updates

As many of you may recall, I've been working on a book, Understanding IT: A Guide for Business Leaders, and I had recently decided to publish my graduate thesis under the title Current Trends in Business Intelligence. What you probably haven't known is the progress that I've made on these projects.

Understanding IT is a book that aims to give a high-level overview of the Information Technology science, career, and best practices from bus architecture to databases while being specifically targeted towards small business leaders or newly appointed manager over IT assets and personnel.

Current Trends is my graduate thesis outlining how companies have traditionally acquired data, turned it into knowledge, and used that knowledge to make money; why business intelligence has traditionally been a privilege of the silicon valley giants; and why the rise of open source products and MOOCs are making business intelligence more applicable to smaller firms.

I'm about five of ten chapters completed with my rough draft of Understanding IT, and about 1/3 completed with the 30pg thesis that Current Trends represents, so I'm feeling relatively confident that I can have a rough manuscript completed by late April or early May.  After that, I'll hand the manuscripts off to an editor (in this case Gabriel Fitzpatrick), come up with something for the cover art, and do a whole bunch of administrative junk associated with self publishing.

My goal is to be completed sometime around Christmas with a publication date of January 2015!

Why It's Not About Privacy

Why It's Not About Privacy

I've faced some opposition recently based on my views that the Electronic Frontier Foundation did a disservice to their constituents by focusing so much of their efforts on privacy, rather than data ownership.  With that in mind, I pose two ethical scenarios to help illustrate my (and the Guardian's) point that solving the data ownership debate will solve far more than just the privacy debate.

Our laws are focused on data collection, but the existence of data is not the concern; it’s the usage and sharing of data.  In today’s interconnected world, individuals are no longer as concerned about what a given company knows about them, but how it’s used and with whom that information is shared.  These are issues that cannot be solved when we limit the scope of our conversation to privacy, but must be evaluated in the larger discussion of establishing ethical data ownership legislation.

NSA and Graph Theory

Foreign governments are in a state of panic and are looking to "balkanize" the internet, domestic judges are ruling the methods unconstitutional, and lawmakers are looking to turn off the utilities at NSA plants.  However, many people (myself included) take some solace in the fact that we may not be under as much scrutiny as we might think.  We like to assume that if we can't make sense of that much information, then no one can; and the more we know about analyzing data, the more often we jump to that logical fallacy.

I was among those.  Having taken graduate courses in data analytics, I was operated and espoused the belief that the NSA can't possibly analyze all of the information that they're collecting through PRISM, CO-TRAVELER, and Landscaping; and while I was not wrong, it turns out that they don't actually have to.  After a conversation with +Andreas Schou, I was introduced to Graph Theory; the methodology that scientists have been using to make sense of large amounts of relational data for years.

Graph Theory is the study of graphs, which are mathematical structures used to model the relationships between objects.  These objects are connected by "edges" which map the objects based on observed or mathematically inferred relationships.  Graph theory is primarily used in the study of discrete mathematics, but can also be used in computer science to represent networks of devices, data, or information flow; sociology to measure an actors' prestige (Six Degrees of Kevin Bacon); social network analysis; and analyzing associations within criminal organizations.

This associative analysis can help intelligence analysts determine the relationship between different objects (a credit card can be linked to a cell phone, which can be linked to a person who has a criminal record).  The problem with PRISM (et al) is that the intelligence net that is cast is so large that information overload is a serious problem.  How does the NSA, or any large data company (Google, Amazon, Facebook), handle these large data sets?  As we know, any savvy criminal will have more than one phone and almost everyone has more than one e-mail address, credit card, or digital avatar.  The sheer number of objects contained within a graph that attempts to map every transaction, phone call, and relationship will quickly become unmanageable.
Within graph theory, algorithms are relied on to handle and split the complex data into smaller, more manageable graphs.  I'm not a data scientist, so I'm very fuzzy on the specifics of this, but large complex graphs can be split into smaller graphs through algorithmic computation.  These smaller graphs isolate a section of objects that are known to be of interest to law enforcement agencies, and then this data can be analyzed.  For example, if the Los Angeles Police Department picks up a known fugitive and determines that his phone number is 999-4442, then using graph theory the NSA could extract a subgraph of information relationships deemed most relevant to 999-4442; such as that fugitive's credit card, his burner phone, his favorite pizza parlor's phone number.  Ideally, contained within this subgraph of information will be a link to another, unknown, criminal who may be participating in illegal activities.

Mr. Schou posits that through a combination of PRISMCO-TRAVELER, and Landscaping information, the NSA can create a relational graph of virtually everyone in the world.  The NSA's three degrees methodology for determining from whom they collect information is enough to guarantee that almost everyone is going to end up on the NSA's radar. Mr. Schou goes on to make the distinction that since the NSA is not actually collecting data against persons in their new metadata surveillance methods, but against phone numbers, credit card numbers, and virtual avatars, the surveillance net quickly reaches an exponential growth rate.  For example, take a look at how this "three degree" methodology is explained with this slideshow:



Going back to our example involving the Los Angeles Police Department, the fugitive with the phone number of "999-4442" will enable the surveillance of 125,000 individuals; and this just looks at the data collection associated with PRISM.  When you add in CO-TRAVELER and Landscaping methodologies, that Los Angeles fugitive is going to cast a pretty wide net.  When you consider that some phone numbers are going to be related to significantly more than 50 individuals (have you ever called Microsoft's Tech Support?) then the exponential increase from this "three hops" rule is going to be infinite.

So who does this "Three Hops Rule" actually protect?

Reading Survey Results

Reading Survey Results

As part of my graduate studies in Business Analytics at the University of Arkansas (Fayetteville), I was given in a homework assignment in which I was given my second experience conducting a survey.  My first experience was almost a year ago when my boss tasked me with surveying and analyzing the retention metrics for a group of eighty four military members in our squadron.  I did my best with Adobe forms, excel spreadsheets, and a basic understanding of freshman statistics.  Needless to say, I handled this survey with a little more of a systematic approach.

Privacy, PRISM, and the flawed 1984 Argument

Privacy, PRISM, and the flawed 1984 Argument

"Privacy, in other words, involves so many things that it is impossible to reduce them all to one simple idea.  And we need not do so."

Privacy means different things to different people, and it's almost insultingly oppressive to attempt to superimpose your definition of privacy onto others.  For example, my personal belief is that anything conducted in or seen from the public domain should hold no expectation of privacy.  However, others might disagree with that assertion and deem me to be a government stooge.

Who am I to superimpose my privacy expectations onto them?  Unfortunately, in a democratic republic, that's what we do every time we vote.  That's what our leaders do every time they make decisions, and that's what PRISM and CISPA is all about.