A Spark in the Cloud
A Spark in the Cloud
Complete large processing tasks by harnessing Amazon Web Services EC2, Apache Spark, and the Apache Zeppelin data exploration tool.
Last month I looked at how to use Apache Spark to run compute jobs on clusters of machines [1]. This month, I'm going to take that a step further by looking at how to parallelize the jobs easily and cheaply in the cloud and how to make sense of the data it produces.
Both of these tasks are somewhat interrelated, because if you're going to run your software in the cloud, it's helpful to have a good front end to control it, and this front end should provide a good way of analyzing the data.
Big Data is big business at the moment, and you have lots of options for controlling Spark running in the cloud. However, many of these choices are closed source and could lead to vendor lock-in if you start developing your code in them. To be sure you're not tied down to any one cloud provider and can always run your code on whatever hardware you like. I recommend Apache Zeppelin as a front end. Zepplin is open source, and it's supported by Amazon's Elastic Map Reduce (EMR), which means it's quick and easy to get started.
Although Zeppelin will work just as well running on your own infrastructure, I'm running it on Amazon because that's the easiest and cheapest way to get access to a large amount of computing power.
To begin, you'll need to set up an account with Amazon Web Services (AWS) [2]. Following this tutorial will cost you a little money, but it needn't be much (as you'll see in a bit). Working out exactly how much something will cost in AWS can be a little complex, and it's made even more difficult because you can get some services – up to a certain amount – for free. I'll try and keep everything simple (and cheap).
You have to pay for the machines you use. Although AWS has a mind-boggling number of different machines [3], rather than going too far into the details, I've found the m4.xlarge machine works well, and, as you scale up, it can be useful to use the m4.4xlarge or 10xlarge. The cost is a little difficult to predict. The basic on-demand price is $0.20 per hour; however, you don't actually need to pay this much, because AWS has a feature called "spot instances" that let you bid on unused capacity. With spot instances, you set a maximum price you're willing to pay, and if that's more than anyone else is willing to pay, you get the machines. A further complication is that instances can be in different data centers around the world. Because spot bids are per data center, you can often find cheaper machines by shopping around the different regions. You can see the current minimum price you need to pay for a spot instance in a particular region by going to the AWS website [2] then, in the box menu in the top left corner, selecting EC2. Under Instances in the left-hand menu, you'll see spot requests, and in the new page, you'll see Pricing History. At the time of writing, m4.xlarge machines are $0.064 in northern Virginia, but $0.023 in London. If you start doing large amounts of data processing using AWS, you'll need to pick in which region to store your data, and the availability of spot instances can be a key factor in this decision. Obviously, the saving here of $0.177 per hour isn't huge, but if you're using lots of machines that are each significantly more powerful (and expensive), the saving on spot instances can make a huge difference.
As a word of caution, spot instances are charged per hour, but if someone outbids you, they get the machine instantly, and you get cut off. The Hadoop platform that Spark runs on is quite resilient to this process, as long as you don't lose all the computers in the cluster. I'll look at how to stop this a bit later, but for now, I'll just say that it's best not to bid too close to the current spot price; otherwise, you're liable to lose your machine very quickly.
When you start up an EC2 machine, you get a bare Linux environment, on which you'll need to install a bunch of software. You could write a script that sets up everything for you, but it's far easier to let Amazon organize the work for you with EMR, which will set up everything you need on your machines. It's an additional $0.06 per machine per hour for m4.xlarge machines to use EMR, so the total cost to set up a simple two-machine cluster with EMR is $0.166 per hour.
EMR is under Analytics in the AWS box menu. On the EMR page, click on Create Cluster. You'll need to switch to Advanced Options (because you can't select Zeppelin under the quick options), then make sure that you have both Zeppelin and Spark checked, as well as the default options.
Under the Hardware tab, you can select the machines you want. EMR offers three different types of machines: Master, Core, and Task. The general advice for running stable clusters is to have a single Master machine, enough Core machines to run the job in the worst case, then as many Task machines as you want. With the Master and Core machines on an uninterruptable tariff (e.g., on-demand) and the Task machines as spot instances, your job won't be killed halfway through if someone outbids you, but it will finish sooner (and cheaper) if spot instances are available. However, for simple tests, I usually use spot instances for all my machines because I'm a cheapskate. In the web browser, you can delete the entire row for Core machines, then set the number and spot prices for the Master and Core machines.
On the next screen, turn off Logging and Termination Protection (both of these are more useful when you have a pre-defined job ready to run). Give your cluster a useful name and hit Next, then Create Cluster to set up your machines. It takes a few minutes for the machines to be set up and have all the software installed.
For security reasons, the cluster will be set up with everything locked down by a firewall, and you need to add a rule that allows you in. On the EMR Cluster screen (Figure 1), you should see Security groups for Master followed by a link. Clicking that link will take you to a new screen. Check the master security group and select Action | Edit inboud rules. Create a rule for your IP address (you can find this by visiting the What Is My IP Address site [4] followed by /32 for ports 0-65535. Back on the EMR screen, you can now click on the Zeppelin link to access the web UI (Figure 2).
From here, you can run your Spark code in your cluster from the web browser. Zeppelin code is organized into Notebooks, each of which contain "paragraphs" of code (the language is set at the start of the paragraph with %pyspark
for Python or %sql
for SparkSQL. The results of SQL queries are automatically transformed into charts. You can see an example of how to get started at Notebook | Zeppelin Tutorial.
Don't forget to terminate your EMR instances when you're finished, or you'll continue to be charged.
Infos
- "Tutorials – Apache Spark" by Ben Evarard, Linux Pro Magazine, isue 202, September 2017, pg. 89, http://www.linuxpromagazine.com/Issues/2017/202/Tutorials-Apache-Spark
- AWS: http://aws.amazon.com
- AWS machines: http://www.ec2instances.info
- What Is My IP Address: http://whatismyip.com
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.
-
ZorinOS 17.1 Released, Includes Improved Windows App Support
If you need or desire to run Windows applications on Linux, there's one distribution intent on making that easier for you and its new release further improves that feature.
-
Linux Market Share Surpasses 4% for the First Time
Look out Windows and macOS, Linux is on the rise and has even topped ChromeOS to become the fourth most widely used OS around the globe.
-
KDE’s Plasma 6 Officially Available
KDE’s Plasma 6.0 "Megarelease" has happened, and it's brimming with new features, polish, and performance.
-
Latest Version of Tails Unleashed
Tails 6.0 is based on Debian 12 and includes GNOME 43.
-
KDE Announces New Slimbook V with Plenty of Power and KDE’s Plasma 6
If you're a fan of KDE Plasma, you'll be thrilled to hear they've announced a new Slimbook with an AMD CPU and the latest version of KDE Plasma desktop.
-
Monthly Sponsorship Includes Early Access to elementary OS 8
If you want to get a glimpse of what's in the pipeline for elementary OS 8, just set up a monthly sponsorship to help fund its continued existence.
-
DebConf24 to be Held in South Korea
Busan will be the location of the latest DebConf running July 28 through August 4
-
Fedora Unleashes Atomic Desktops
Fedora has combined its solid distribution with rpm-ostree system to make it possible to deliver a new family of Fedora spins, called Fedora Atomic Desktops.
-
Bootloader Vulnerability Affects Nearly All Linux Distributions
The developers of shim have released a version to fix numerous security flaws, including one that could enable remote control execution of malicious code under certain circumstances.