A Spark in the Cloud
A Spark in the Cloud
Complete large processing tasks by harnessing Amazon Web Services EC2, Apache Spark, and the Apache Zeppelin data exploration tool.
Last month I looked at how to use Apache Spark to run compute jobs on clusters of machines [1]. This month, I'm going to take that a step further by looking at how to parallelize the jobs easily and cheaply in the cloud and how to make sense of the data it produces.
Both of these tasks are somewhat interrelated, because if you're going to run your software in the cloud, it's helpful to have a good front end to control it, and this front end should provide a good way of analyzing the data.
Big Data is big business at the moment, and you have lots of options for controlling Spark running in the cloud. However, many of these choices are closed source and could lead to vendor lock-in if you start developing your code in them. To be sure you're not tied down to any one cloud provider and can always run your code on whatever hardware you like. I recommend Apache Zeppelin as a front end. Zepplin is open source, and it's supported by Amazon's Elastic Map Reduce (EMR), which means it's quick and easy to get started.
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Kubuntu Focus Goes Ultra
The Kubuntu Focus team has upped the performance ante of its M2 and Zr laptops with the latest, greatest CPUs from Intel.
-
Linux Gamers May Soon See Less Mouse Lag in KDE Plasma
Gamers using KDE’s Plasma desktop have been suffering from a slight input delay in mouse movement that could lead to getting fragged.
-
Three Lines of Code Improve Linux Storage Performance
A developer changed three lines of code, giving Linux storage performance a 5% bump.
-
AUR Hit Again with Malicious Packages
Once again the Arch User Repository is plagued by a high volume of malicious packages.
-
Alpine Linux 3.24 Features Fresh Desktops and a Newer Kernel
If you're a fan of Alpine Linux, it's time to upgrade because the latest version has been released with KDE Plasma 6.6, Gnome 50, and Linux kernel 6.18 LTS.
-
EU Open Source Strategy Plays Key Role in Tech Sovereignty Package
Comprehensive measures adopted by the European Commission aim to reduce dependency on non-EU countries.
-
Linux Foundation Report Indicates AI Driving Tech Hiring
Within growing security and skills gaps, AI has been found to be a positive driving force behind tech hiring trends in Europe.
-
United Nations Open Source Portal Goes Live
A new open source portal seeks to coordinate and scale open source efforts across the United Nations system.
-
KDE Linux Drops AUR
KDE Linux developers have dropped the Arch User Repository from the build pipeline due to security concerns; other distributions should consider doing the same.
-
California May Exempt Linux from Its Age-Verification Law
After backlash from the Linux community, California may be backing off on its promise to force all operating systems to verify age, but one platform may still have to comply.
