A generic processing framework designed to execute queries and batch read operations against massive datasets, across clusters of computers, which facilitate the organizations to scans through tons of data (which are first loaded into the Hadoop Distributed File System - HDFS), and produce results that are meaning to the them. Simply put, Hadoop is the key open source technology that provides a Big Data Engine.
Hadoop operates on massive datasets by horizontally scaling the processing across very large numbers of servers through an approach called MapReduce and not by vertical scaling which requires powerful single server to process the huge data in a timely manner.
Hundreds or thousands of small, inexpensive, commodity servers do have the power if the processing can be horizontally scaled and executed in parallel. Using the MapReduce approach, Hadoop splits up a problem, sends the sub-problems to different servers, and lets each server solve its sub-problem in parallel. It then merges all the sub-problem solutions together and writes out the solution into files which may in turn be used as inputs into additional MapReduce steps.
Although Hadoop provides a platform for data storage and parallel processing, the real value comes from add-ons subprojects (ZooKeeper, Pig, Hive, Lucene, HBase, etc), which adds functionality and new capabilities to the platform. Most implementations of a Hadoop platform will include at least some of these subprojects, for example an organization will choose HDFS as the primary distributed file system and HBase as the database to store billions of rows of data and MapReduce as the framework for distributed processing.
A number of companies are emerging with the different plans to help the organization in using Hadoop by extending support or by providing professional services or by producing tools that work along with Hadoop and make it easier to use or by providing a complete platform (based on Hadoop) that addresses many of the enterprise needs. It is worthwhile to look at few of the players in this segment
IBM took the open source Big Data technology - Hadoop and extended it into an enterprise ready Big Data platform.
IBM delivers a Hadoop platform that is hardened for enterprise use with deep consideration for high availability, scalability, performance, ease-of-use and other things one normally expect out of solution to be deployed in production environment.
Also InfoSphere BigInsights flatten the time-to-value curve associated with Big Data analytics by providing the development and runtime environments for developers to build advanced analytical applications and providing tools for business users to analyze the data.
(Cloudera's Distribution for Hadoop)
Cloudera delivers an integrated Apache Hadoop-based stack containing all the components needed for production use, tested and packaged to work together. It incorporates only software from open source projects – no forks or proprietary underpinnings and comes with Cloudera Manager which is a end-to-end management application for Apache Hadoop that includes revolutionary features such as proactive health checks and intelligent log management
MapR’s M5 make Hadoop more reliable (provides full data protection, no single points of failure), more affordable, more manageable (improved performance) and significantly easier to use.
To put in perceptive, Hadoop should never be considered as replacement of relational databases or data ware housing, but something that will coexist and complement the traditional data store to provide richer capabilities to the organization. While traditional ware houses are ideal for analyzing structured data from various systems, the sheer magnitude of unstructured and semi structured data involved makes it very sensible to use the cheap cycles of server farms to transform masses of unstructured data with low information density into smaller amounts of dense structured data that is then loaded into traditional database for further analysis.
To conclude, open source Hadoop offers a great deal of potential for enterprises to harness the data (structured, semi structured or has no structure at all) that was until now difficult to manage and analyze. Hadoop is also gaining wider acceptance with vendors who are coming out with various Hadoop-based stack to significantly provide a better user experience.