Of lakes and sparks – How Hadoop 2 got it right

Apache Spark

Apache Spark [7] is an exciting project that provides enhanced Hadoop MapReduce capability. First, Spark is a great in-memory parallel processing tool. It is a fast and general-purpose cluster computing system and provides high-level APIs in Java, Scala, and Python and an optimized engine that supports general execution graphs. It also supports a rich set of higher level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Second, Spark offers more than 80 high-level operators that make it easy to build parallel applications. It can be used interactively from the Scala and Python shells. Spark can be run in a standalone cluster mode, on EC2, or on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.

Finally, many programs run up to 100 times faster than Hadoop batch MapReduce jobs when using in-memory processing and about 10 times faster on disk – similar to the results obtained with Tez. It also is worth mentioning that Spark is part of the next generation Stinger Project to improve Hadoop Hive SQL. Then again, if you believe the hype in the article mentioned in the first paragraph, you might not learn that Hadoop version 2 is no longer a one-trick pony and is now the preferred platform for new tools designed to take advantage of the growing data lake.

Other than trying it out, there is not much more to mention about Spark and Hadoop. Spark does seem a bit easier to use than writing a Java application using the Hadoop MapReduce APIs, but the advantage is that the developer can decide what approach works best and then head out onto the data lake.

More to Come

Many people are surprised to learn the extent of Hadoop 2 application development, including many applications that were not possible with Hadoop version 1 and many examples of how to write an application that will run under Hadoop YARN. Much of the current confusion that surrounds Hadoop exists because Hadoop has transitioned from an application to a far-reaching platform on which applications like Spark can be implemented. The value of an open Big Data platform cannot be understated. Hadoop version 2 is the operating system for data lake clusters and a milestone in the evolution of data analysis.


  1. Jackson, J. "Hadoop successor sparks a data analysis evolution" IDG News Service, December 5, 2014, http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html
  2. Big Data characteristics: http://en.wikipedia.org/wiki/Big_data
  3. Big Data surprises: http://www.sisense.com/blog/big-data-surprises
  4. Rowstron, A., D. Narayanan, A. Donnelly, G. O'Shea, and A. Douglas. "Nobody ever got fired for using Hadoop on a cluster." In: 1st International Workshop on Hot Topics in Cloud Data Processing (Bern, Switzerland, Association for Computing Machinery, 2012), http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
  5. Apache Tez: http://tez.apache.org/install.html
  6. Stinger project: http://hortonworks.com/labs/stinger/
  7. Apache Spark: https://spark.apache.org

The Author

Douglas Eadline @thedeadline has been writing about and developing cluster computing for more than 20 years.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More