An Introduction into Hadoop

Hadoop was started as an open-source search engine project called Nutch by Mike Cafarella and Doug Cutting that was merged with Apache Lucene in 2006 and renamed to Hadoop.  It remains an open source solution and was built on the MapReduce foundation developed by Google.  Hadoop, as a result of its MapReduce heritage, performs filtering and sorting (known as mapping) and summarization (known as reduction).  These functions are solved in parallel among multiple computers (also referred to as "nodes") that make up a cluster which is controlled by a "master node."

During the MapReduce process, the master node takes the data and a user defined problem and distributes smaller subsets of the problem and data to nodes within the cluster so that multiple servers can work on smaller subsets of the original problem simultaneously.  Once a computer is finished working on its sub-problem, it passes the information back to its master node, where the master node compiles the information into a summary before presenting its findings to the user.

This MapReduce function is handled through a java daemon service known as Resource Manager which receives client specified requests, communicates these to the Master Node, and then assigns smaller tasks to children nodes within the cluster via their Node Managers.  This process is handled and monitored through “heartbeats” and permits the failure of one or more children nodes.  If a child node were to fail, the task would be restarted from the last checkpoint (a snapshot of the task’s progress) on another, known good, node.

This intra-cluster coordination takes place as a result of an in memory, NoSQL database known as ZooKeeper which keeps track of the nodes, their status, and their assigned projects.  Additionally, as the failure of a master node can also be tolerated with MapReduce v2, removing the single point-of-failure.

Due to this distributed computing model, Hadoop is able to manipulate large amounts of data incredibly efficiently and quickly. Once the data is analyzed, it is stored in an open-source transaction database known as HBase, which is built upon the open-source standard of NoSQL.  Data is manipulated in HBase through a programming language called Pig, which is a mostly-human readable language (much like COBOL was), with limited functionality. As a result, data isn’t typically kept in HBase long due to its lack of security and storage tools (like back-ups), so it’s often just used as “pre-processing” step before being shipped off to a data warehouse. 

The data warehousing solution provided by Hadoop is known as Hive, which uses a language known as HiveQL to replace SQL, DDL, and DML and offers a more robust solution for data retention and analysis.





Whereas the previous Business Intelligence model required an expansive Extract, Transform, Load process where data was drawn into the data warehouse through multiple disparate transaction systems, the Hadoop model allows the ETL process to take place more quickly through a Hadoop cluster. This increased simplification is due to Hadoops simplified tool sets (Pig, HBase, ZooKeeper) that are designed specifically for handling "big" data.

I recommend messing around with Hadoop through Hortonworks, which offers a sandbox virtual machine environment before moving on to the more technical details of installing a single-node cluster on a home computer, or hiring a virtual consultant to help you with the process!

1 Hour Consultation: Data Analysis
Add To Cart