Of lakes and sparks – How Hadoop 2 got it right
Apache Spark
Apache Spark [7] is an exciting project that provides enhanced Hadoop MapReduce capability. First, Spark is a great in-memory parallel processing tool. It is a fast and general-purpose cluster computing system and provides high-level APIs in Java, Scala, and Python and an optimized engine that supports general execution graphs. It also supports a rich set of higher level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Second, Spark offers more than 80 high-level operators that make it easy to build parallel applications. It can be used interactively from the Scala and Python shells. Spark can be run in a standalone cluster mode, on EC2, or on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
Finally, many programs run up to 100 times faster than Hadoop batch MapReduce jobs when using in-memory processing and about 10 times faster on disk – similar to the results obtained with Tez. It also is worth mentioning that Spark is part of the next generation Stinger Project to improve Hadoop Hive SQL. Then again, if you believe the hype in the article mentioned in the first paragraph, you might not learn that Hadoop version 2 is no longer a one-trick pony and is now the preferred platform for new tools designed to take advantage of the growing data lake.
Other than trying it out, there is not much more to mention about Spark and Hadoop. Spark does seem a bit easier to use than writing a Java application using the Hadoop MapReduce APIs, but the advantage is that the developer can decide what approach works best and then head out onto the data lake.
More to Come
Many people are surprised to learn the extent of Hadoop 2 application development, including many applications that were not possible with Hadoop version 1 and many examples of how to write an application that will run under Hadoop YARN. Much of the current confusion that surrounds Hadoop exists because Hadoop has transitioned from an application to a far-reaching platform on which applications like Spark can be implemented. The value of an open Big Data platform cannot be understated. Hadoop version 2 is the operating system for data lake clusters and a milestone in the evolution of data analysis.
Infos
- Jackson, J. "Hadoop successor sparks a data analysis evolution" IDG News Service, December 5, 2014, http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html
- Big Data characteristics: http://en.wikipedia.org/wiki/Big_data
- Big Data surprises: http://www.sisense.com/blog/big-data-surprises
- Rowstron, A., D. Narayanan, A. Donnelly, G. O'Shea, and A. Douglas. "Nobody ever got fired for using Hadoop on a cluster." In: 1st International Workshop on Hot Topics in Cloud Data Processing (Bern, Switzerland, Association for Computing Machinery, 2012), http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
- Apache Tez: http://tez.apache.org/install.html
- Stinger project: http://hortonworks.com/labs/stinger/
- Apache Spark: https://spark.apache.org
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
So Long Neofetch and Thanks for the Info
Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
-
Ubuntu 24.04 Comes with a “Flaw"
If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
-
Canonical Releases Ubuntu 24.04
After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.