A Spark in the Cloud

Article from Issue 203/2017

Author(s): Ben Everard

Complete large processing tasks by harnessing Amazon Web Services EC2, Apache Spark, and the Apache Zeppelin data exploration tool.

Last month I looked at how to use Apache Spark to run compute jobs on clusters of machines [1]. This month, I'm going to take that a step further by looking at how to parallelize the jobs easily and cheaply in the cloud and how to make sense of the data it produces.

Both of these tasks are somewhat interrelated, because if you're going to run your software in the cloud, it's helpful to have a good front end to control it, and this front end should provide a good way of analyzing the data.

Big Data is big business at the moment, and you have lots of options for controlling Spark running in the cloud. However, many of these choices are closed source and could lead to vendor lock-in if you start developing your code in them. To be sure you're not tied down to any one cloud provider and can always run your code on whatever hardware you like. I recommend Apache Zeppelin as a front end. Zepplin is open source, and it's supported by Amazon's Elastic Map Reduce (EMR), which means it's quick and easy to get started.

Although Zeppelin will work just as well running on your own infrastructure, I'm running it on Amazon because that's the easiest and cheapest way to get access to a large amount of computing power.

To begin, you'll need to set up an account with Amazon Web Services (AWS) [2]. Following this tutorial will cost you a little money, but it needn't be much (as you'll see in a bit). Working out exactly how much something will cost in AWS can be a little complex, and it's made even more difficult because you can get some services – up to a certain amount – for free. I'll try and keep everything simple (and cheap).

You have to pay for the machines you use. Although AWS has a mind-boggling number of different machines [3], rather than going too far into the details, I've found the m4.xlarge machine works well, and, as you scale up, it can be useful to use the m4.4xlarge or 10xlarge. The cost is a little difficult to predict. The basic on-demand price is $0.20 per hour; however, you don't actually need to pay this much, because AWS has a feature called "spot instances" that let you bid on unused capacity. With spot instances, you set a maximum price you're willing to pay, and if that's more than anyone else is willing to pay, you get the machines. A further complication is that instances can be in different data centers around the world. Because spot bids are per data center, you can often find cheaper machines by shopping around the different regions. You can see the current minimum price you need to pay for a spot instance in a particular region by going to the AWS website [2] then, in the box menu in the top left corner, selecting EC2. Under Instances in the left-hand menu, you'll see spot requests, and in the new page, you'll see Pricing History. At the time of writing, m4.xlarge machines are $0.064 in northern Virginia, but $0.023 in London. If you start doing large amounts of data processing using AWS, you'll need to pick in which region to store your data, and the availability of spot instances can be a key factor in this decision. Obviously, the saving here of $0.177 per hour isn't huge, but if you're using lots of machines that are each significantly more powerful (and expensive), the saving on spot instances can make a huge difference.

As a word of caution, spot instances are charged per hour, but if someone outbids you, they get the machine instantly, and you get cut off. The Hadoop platform that Spark runs on is quite resilient to this process, as long as you don't lose all the computers in the cluster. I'll look at how to stop this a bit later, but for now, I'll just say that it's best not to bid too close to the current spot price; otherwise, you're liable to lose your machine very quickly.

When you start up an EC2 machine, you get a bare Linux environment, on which you'll need to install a bunch of software. You could write a script that sets up everything for you, but it's far easier to let Amazon organize the work for you with EMR, which will set up everything you need on your machines. It's an additional $0.06 per machine per hour for m4.xlarge machines to use EMR, so the total cost to set up a simple two-machine cluster with EMR is $0.166 per hour.

EMR is under Analytics in the AWS box menu. On the EMR page, click on Create Cluster. You'll need to switch to Advanced Options (because you can't select Zeppelin under the quick options), then make sure that you have both Zeppelin and Spark checked, as well as the default options.

Under the Hardware tab, you can select the machines you want. EMR offers three different types of machines: Master, Core, and Task. The general advice for running stable clusters is to have a single Master machine, enough Core machines to run the job in the worst case, then as many Task machines as you want. With the Master and Core machines on an uninterruptable tariff (e.g., on-demand) and the Task machines as spot instances, your job won't be killed halfway through if someone outbids you, but it will finish sooner (and cheaper) if spot instances are available. However, for simple tests, I usually use spot instances for all my machines because I'm a cheapskate. In the web browser, you can delete the entire row for Core machines, then set the number and spot prices for the Master and Core machines.

On the next screen, turn off Logging and Termination Protection (both of these are more useful when you have a pre-defined job ready to run). Give your cluster a useful name and hit Next, then Create Cluster to set up your machines. It takes a few minutes for the machines to be set up and have all the software installed.

For security reasons, the cluster will be set up with everything locked down by a firewall, and you need to add a rule that allows you in. On the EMR Cluster screen (Figure 1), you should see Security groups for Master followed by a link. Clicking that link will take you to a new screen. Check the master security group and select Action | Edit inboud rules. Create a rule for your IP address (you can find this by visiting the What Is My IP Address site [4] followed by /32 for ports 0-65535. Back on the EMR screen, you can now click on the Zeppelin link to access the web UI (Figure 2).

Figure 1: The AWS interface makes it easy to get your cluster up and running.

Figure 2: Zeppelin is an east-to-use interface for running massively parallel jobs via Spark.

From here, you can run your Spark code in your cluster from the web browser. Zeppelin code is organized into Notebooks, each of which contain "paragraphs" of code (the language is set at the start of the paragraph with %pyspark for Python or %sql for SparkSQL. The results of SQL queries are automatically transformed into charts. You can see an example of how to get started at Notebook | Zeppelin Tutorial.

Don't forget to terminate your EMR instances when you're finished, or you'll continue to be charged.

Infos

"Tutorials – Apache Spark" by Ben Evarard, Linux Pro Magazine, isue 202, September 2017, pg. 89, http://www.linuxpromagazine.com/Issues/2017/202/Tutorials-Apache-Spark
AWS: http://aws.amazon.com
AWS machines: http://www.ec2instances.info
What Is My IP Address: http://whatismyip.com

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Canonical Releases Ubuntu 24.04

Gnome , Linux , open source , Ubuntu

After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.

A Spark in the Cloud

A Spark in the Cloud

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Canonical Releases Ubuntu 24.04

Linux Servers Targeted by Akira Ransomware

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

XZ Gets the All-Clear

Canonical Collaborates with Qualcomm on New Venture

Kodi 21.0 Open-Source Entertainment Hub Released

Linux Usage Increases in Two Key Areas

Vulnerability Discovered in xz Libraries

Canonical Bumps LTS Support to 12 years

Fedora 40 Beta Released Soon

A Spark in the Cloud

A Spark in the Cloud

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters