Of lakes and sparks – How Hadoop 2 got it right
Apache Spark
Apache Spark [7] is an exciting project that provides enhanced Hadoop MapReduce capability. First, Spark is a great in-memory parallel processing tool. It is a fast and general-purpose cluster computing system and provides high-level APIs in Java, Scala, and Python and an optimized engine that supports general execution graphs. It also supports a rich set of higher level tools, including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
Second, Spark offers more than 80 high-level operators that make it easy to build parallel applications. It can be used interactively from the Scala and Python shells. Spark can be run in a standalone cluster mode, on EC2, or on Hadoop YARN or Apache Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
Finally, many programs run up to 100 times faster than Hadoop batch MapReduce jobs when using in-memory processing and about 10 times faster on disk – similar to the results obtained with Tez. It also is worth mentioning that Spark is part of the next generation Stinger Project to improve Hadoop Hive SQL. Then again, if you believe the hype in the article mentioned in the first paragraph, you might not learn that Hadoop version 2 is no longer a one-trick pony and is now the preferred platform for new tools designed to take advantage of the growing data lake.
Other than trying it out, there is not much more to mention about Spark and Hadoop. Spark does seem a bit easier to use than writing a Java application using the Hadoop MapReduce APIs, but the advantage is that the developer can decide what approach works best and then head out onto the data lake.
More to Come
Many people are surprised to learn the extent of Hadoop 2 application development, including many applications that were not possible with Hadoop version 1 and many examples of how to write an application that will run under Hadoop YARN. Much of the current confusion that surrounds Hadoop exists because Hadoop has transitioned from an application to a far-reaching platform on which applications like Spark can be implemented. The value of an open Big Data platform cannot be understated. Hadoop version 2 is the operating system for data lake clusters and a milestone in the evolution of data analysis.
Infos
- Jackson, J. "Hadoop successor sparks a data analysis evolution" IDG News Service, December 5, 2014, http://www.computerworld.com/article/2856063/enterprise-software/hadoop-successor-sparks-a-data-analysis-evolution.html
- Big Data characteristics: http://en.wikipedia.org/wiki/Big_data
- Big Data surprises: http://www.sisense.com/blog/big-data-surprises
- Rowstron, A., D. Narayanan, A. Donnelly, G. O'Shea, and A. Douglas. "Nobody ever got fired for using Hadoop on a cluster." In: 1st International Workshop on Hot Topics in Cloud Data Processing (Bern, Switzerland, Association for Computing Machinery, 2012), http://research.microsoft.com/pubs/163083/hotcbp12%20final.pdf
- Apache Tez: http://tez.apache.org/install.html
- Stinger project: http://hortonworks.com/labs/stinger/
- Apache Spark: https://spark.apache.org
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Direct Download
Read full article as PDF:
Price $2.95
News
-
Mozilla VPN Now Available for Linux
The promised subscription-based VPN service from Mozilla is now available for the Linux platform.
-
Wayland and New App Menu Coming to KDE
The 2021 roadmap for the KDE desktop environment includes some exciting features and improvements.
-
Deepin 20.1 has Arrived
Debian-based Deepin 20.1 has been released with some interesting new features.
-
CloudLinux Commits Over 1 Million Dollars to CentOS Replacement
An open source, drop-in replacement for CentOS is on its way.
-
Linux Mint 20.1 Beta has Been Released
The first beta of Linux Mint, Ulyssa, is now available for downloading.
-
Manjaro Linux 20.2 has Been Unleashed
The latest iteration of Manjaro Linux has been released with a few interesting new features.
-
Patreon Project Looks to Bring Linux to Apple Silicon
Developer Hector Martin has created a patreon page to fund his work on developing a port of Linux for Apple Silicon Macs.
-
A New Chrome OS-Like Ubuntu Remix is Now Available
Ubuntu Web looks to be your Chrome OS alternative.
-
System76 Refreshes the Galago Pro Laptop
Linux hardware maker has revamped one of their most popular laptops.
-
Dell Will Soon Enable Privacy Controls for Linux Hardware
Dell makes it possible for Linux users to disable webcams and microphones.