A Spark in the Cloud
A Spark in the Cloud
Complete large processing tasks by harnessing Amazon Web Services EC2, Apache Spark, and the Apache Zeppelin data exploration tool.
Last month I looked at how to use Apache Spark to run compute jobs on clusters of machines [1]. This month, I'm going to take that a step further by looking at how to parallelize the jobs easily and cheaply in the cloud and how to make sense of the data it produces.
Both of these tasks are somewhat interrelated, because if you're going to run your software in the cloud, it's helpful to have a good front end to control it, and this front end should provide a good way of analyzing the data.
Big Data is big business at the moment, and you have lots of options for controlling Spark running in the cloud. However, many of these choices are closed source and could lead to vendor lock-in if you start developing your code in them. To be sure you're not tied down to any one cloud provider and can always run your code on whatever hardware you like. I recommend Apache Zeppelin as a front end. Zepplin is open source, and it's supported by Amazon's Elastic Map Reduce (EMR), which means it's quick and easy to get started.
Although Zeppelin will work just as well running on your own infrastructure, I'm running it on Amazon because that's the easiest and cheapest way to get access to a large amount of computing power.
To begin, you'll need to set up an account with Amazon Web Services (AWS) [2]. Following this tutorial will cost you a little money, but it needn't be much (as you'll see in a bit). Working out exactly how much something will cost in AWS can be a little complex, and it's made even more difficult because you can get some services – up to a certain amount – for free. I'll try and keep everything simple (and cheap).
You have to pay for the machines you use. Although AWS has a mind-boggling number of different machines [3], rather than going too far into the details, I've found the m4.xlarge machine works well, and, as you scale up, it can be useful to use the m4.4xlarge or 10xlarge. The cost is a little difficult to predict. The basic on-demand price is $0.20 per hour; however, you don't actually need to pay this much, because AWS has a feature called "spot instances" that let you bid on unused capacity. With spot instances, you set a maximum price you're willing to pay, and if that's more than anyone else is willing to pay, you get the machines. A further complication is that instances can be in different data centers around the world. Because spot bids are per data center, you can often find cheaper machines by shopping around the different regions. You can see the current minimum price you need to pay for a spot instance in a particular region by going to the AWS website [2] then, in the box menu in the top left corner, selecting EC2. Under Instances in the left-hand menu, you'll see spot requests, and in the new page, you'll see Pricing History. At the time of writing, m4.xlarge machines are $0.064 in northern Virginia, but $0.023 in London. If you start doing large amounts of data processing using AWS, you'll need to pick in which region to store your data, and the availability of spot instances can be a key factor in this decision. Obviously, the saving here of $0.177 per hour isn't huge, but if you're using lots of machines that are each significantly more powerful (and expensive), the saving on spot instances can make a huge difference.
As a word of caution, spot instances are charged per hour, but if someone outbids you, they get the machine instantly, and you get cut off. The Hadoop platform that Spark runs on is quite resilient to this process, as long as you don't lose all the computers in the cluster. I'll look at how to stop this a bit later, but for now, I'll just say that it's best not to bid too close to the current spot price; otherwise, you're liable to lose your machine very quickly.
When you start up an EC2 machine, you get a bare Linux environment, on which you'll need to install a bunch of software. You could write a script that sets up everything for you, but it's far easier to let Amazon organize the work for you with EMR, which will set up everything you need on your machines. It's an additional $0.06 per machine per hour for m4.xlarge machines to use EMR, so the total cost to set up a simple two-machine cluster with EMR is $0.166 per hour.
EMR is under Analytics in the AWS box menu. On the EMR page, click on Create Cluster. You'll need to switch to Advanced Options (because you can't select Zeppelin under the quick options), then make sure that you have both Zeppelin and Spark checked, as well as the default options.
Under the Hardware tab, you can select the machines you want. EMR offers three different types of machines: Master, Core, and Task. The general advice for running stable clusters is to have a single Master machine, enough Core machines to run the job in the worst case, then as many Task machines as you want. With the Master and Core machines on an uninterruptable tariff (e.g., on-demand) and the Task machines as spot instances, your job won't be killed halfway through if someone outbids you, but it will finish sooner (and cheaper) if spot instances are available. However, for simple tests, I usually use spot instances for all my machines because I'm a cheapskate. In the web browser, you can delete the entire row for Core machines, then set the number and spot prices for the Master and Core machines.
On the next screen, turn off Logging and Termination Protection (both of these are more useful when you have a pre-defined job ready to run). Give your cluster a useful name and hit Next, then Create Cluster to set up your machines. It takes a few minutes for the machines to be set up and have all the software installed.
For security reasons, the cluster will be set up with everything locked down by a firewall, and you need to add a rule that allows you in. On the EMR Cluster screen (Figure 1), you should see Security groups for Master followed by a link. Clicking that link will take you to a new screen. Check the master security group and select Action | Edit inboud rules. Create a rule for your IP address (you can find this by visiting the What Is My IP Address site [4] followed by /32 for ports 0-65535. Back on the EMR screen, you can now click on the Zeppelin link to access the web UI (Figure 2).
From here, you can run your Spark code in your cluster from the web browser. Zeppelin code is organized into Notebooks, each of which contain "paragraphs" of code (the language is set at the start of the paragraph with %pyspark
for Python or %sql
for SparkSQL. The results of SQL queries are automatically transformed into charts. You can see an example of how to get started at Notebook | Zeppelin Tutorial.
Don't forget to terminate your EMR instances when you're finished, or you'll continue to be charged.
Infos
- "Tutorials – Apache Spark" by Ben Evarard, Linux Pro Magazine, isue 202, September 2017, pg. 89, http://www.linuxpromagazine.com/Issues/2017/202/Tutorials-Apache-Spark
- AWS: http://aws.amazon.com
- AWS machines: http://www.ec2instances.info
- What Is My IP Address: http://whatismyip.com
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
![Learn More](https://www.linux-magazine.com/var/linux_magazin/storage/images/media/linux-magazine-eng-us/images/misc/learn-more/834592-1-eng-US/Learn-More_medium.png)
News
-
NVIDIA Released Driver for Upcoming NVIDIA 560 GPU for Linux
Not only has NVIDIA released the driver for its upcoming CPU series, it's the first release that defaults to using open-source GPU kernel modules.
-
OpenMandriva Lx 24.07 Released
If you’re into rolling release Linux distributions, OpenMandriva ROME has a new snapshot with a new kernel.
-
Kernel 6.10 Available for General Usage
Linus Torvalds has released the 6.10 kernel and it includes significant performance increases for Intel Core hybrid systems and more.
-
TUXEDO Computers Releases InfinityBook Pro 14 Gen9 Laptop
Sporting either AMD or Intel CPUs, the TUXEDO InfinityBook Pro 14 is an extremely compact, lightweight, sturdy powerhouse.
-
Google Extends Support for Linux Kernels Used for Android
Because the LTS Linux kernel releases are so important to Android, Google has decided to extend the support period beyond that offered by the kernel development team.
-
Linux Mint 22 Stable Delayed
If you're anxious about getting your hands on the stable release of Linux Mint 22, it looks as if you're going to have to wait a bit longer.
-
Nitrux 3.5.1 Available for Install
The latest version of the immutable, systemd-free distribution includes an updated kernel and NVIDIA driver.
-
Debian 12.6 Released with Plenty of Bug Fixes and Updates
The sixth update to Debian "Bookworm" is all about security mitigations and making adjustments for some "serious problems."
-
Canonical Offers 12-Year LTS for Open Source Docker Images
Canonical is expanding its LTS offering to reach beyond the DEB packages with a new distro-less Docker image.
-
Plasma Desktop 6.1 Released with Several Enhancements
If you're a fan of Plasma Desktop, you should be excited about this new point release.