Hadoop Distribution Comparison
The three kinds of Hadoop distributions that will be discussed today are: Apache Hadoop, MapR, and Cloudera.
All of them have the same goals of performance, scalability, reliability, and availability. Furthermore, all of them have advantages including massive storage, great computing power, flexibility (Store and process data whenever you want, instead of preprocess before storing data like traditional relational databases. And it enables users to easily access new data sources including social media, email conversations, etc..), fault tolerance (One node fails, jobs still works on other nodes because data is replicated to other nodes in the beginning, so the computing does not fail), low cost (Use commodity hardware to store data), and scalability (More nodes, more storage, and little administration.).
Apache Hadoop is the standard Hadoop ...view middle of the document...
And HDFS and Mapreduce are still rough in manner, and it is still under single master which requires care and may limit scaling. More importantly, HDFS, designed to fit high capacity, lacks the ability to efficiently support the random reading of small files.
MapR and Cloudera originates from Apache by adding new functionality and/or improving the code base, overcoming issues of Apache, providing additional value to customers, and focusing more on reliability, support, and completeness.
MapR distribution goes a step further by replacing HDFS with its own proprietary file system, called MapRFS. MapRFS helps incorporate enterprise-grade features into Hadoop, enabling more efficient management of data, reliability and most importantly, ease of use. It aims to sustain deployments that consists of up to 10,000 of nodes without single point of failure, which is guaranteed by the distributed Name Node. MapR allows for storing 1-10 Exabytes of data and provides support for NFS and random read-write semantics. (Altoros, 2013)
Cloudera has the most powerful Hadoop deployment and administration tools designed for managing a cluster of an unlimited size. Based on Apache Hadoop, the improvement is that it also provides a proprietary Cloudera Management Suite to automate the installation process and provide other services to enhance convenience of users which include reducing deployment time, displaying real time nodes’ count, etc.. When it comes to the downside, it also has the disadvantages regarding HDFS as Apache. (459 words)
Moccio, S.V., & Grim, P. A. (2012). Big Data Engine. Available from
Altoros Systems Inc. (2013). Hadoop Distributions: Evaluating Cloudera, Hortonworks, and MapR in Micro-benchmarks and Real-world Applications. Retrieved from
Apache Software Foundation. (2013). HDFS Architecture Guide. Retrieved from Hadoop: