Highly Parallel Programming with Apache Spark
Tutorials – Apache Spark
Churn through lots of data with cluster computing on Apache's Spark platform.
As a society, we're creating more data than ever before. We're monitoring everything from the planet's weather to the performance of our computers, and we're storing all this information. But how do you process all this data? On a single machine, you can get a few terabytes of disk space and a few hundred gigabytes of memory (at least, you can if your pockets are deep enough), but how do you churn through a petabyte of raw ones and zeros? Basically, you're going to need more than one computer, and you're going to look for a method of running your programs on many machines at the same time: Apache Spark [1].
Before you run off and buy a rack of servers, slow down! We're going to start by introducing Spark on a single machine. Once you've mastered the basics, you can scale up.
Spark is a data processing engine that is often used with Hadoop for managing large amounts of data in a highly distributed manner. If you move forward with Spark, you're probably going to end up with a complete Hadoop setup; however, that's also getting ahead of ourselves. We can start Spark as a standalone service on a single computer.
The first thing you need is Java v1.8. On Ubuntu and derivatives, enter:
apt install openjdk-8-jre
On other systems, take a look at your package manager for details.
The second thing you need is Spark itself. You can download Spark from the Apache website [2]. I used the latest version at the time of writing (2.1.1 Pre-build for Hadoop 2.7). Download this version and extract the downloaded file.
Open a terminal and navigate to the folder that you've just extracted. From there, you should be able to run the following command:
bin/pyspark
This command will start Spark and drop you into a PySpark shell. Spark isn't just a programming environment, it's a way of scheduling jobs across a cluster of machines. This is true even in this case when we're just running on a cluster of one machine. There's a web interface to this server on port 4040, so point your web browser to localhost:4040 to see a list of what's running (Figure 1). At the moment, it won't show much, but this interface is useful for keeping an eye on your Spark server as you have more machines and jobs.
Spark isn't specific to Python. In fact, Scala is the default language, but there are tools for many languages, and Python is my language of preference. PySpark is a tool for submitting Python jobs to the Spark cluster.
Take a look at a simple PySpark program:
import re words = sc.textFile("/usr/share/dict/words") def check_regex(word): return re.match('^[bean][bean][bean][bean]$',word) out = words.filter(check_regex) out.take(5)
Resilient Distributed Datasets (RDDs) are the core concept of Spark. They're basically data structures that are stored across all the machines in the cluster so that any operation can be easily parallelized across all the machines. Each RDD is resilient because it can't change. Any operation on an RDD creates a new one while leaving the old one intact.
Our really simple code here takes the words file from your machine (if it's not at this location, you can download a words file from the Linux Voice site [3]), points your program to the downloaded file), and builds an RDD, with each item in the RDD being created from a line in the file. Essentially, what we have now is an array containing one entry for each word in the English language.
RDDs aren't regular arrays, though. They have methods that allow you to work on them. In this example, I will use the filter
method, which takes a function as its argument, and this function is run on every item in the RDD to create a new RDD. I will also use the take
method, which gets us a sample from the RDD; in this case, the argument 5
means I want five items from the RDD.
The sc
referenced in the program is SparkContext
, which serves as an entry point for Spark functionality. If you're trying to use Spark from outside the PySpark shell, you'll need to import SparkContext
and set it up. Take a look at the documentation for details of how to set up SparkContext
, as it's a bit different depending on how you're running Spark.
This simple program returns up to five words containing just the letters b, e, a, and n (Figure 2). It is, admittedly, not a particularly impressive program, but it gets us started with Spark.
Another core Spark concept is DataFrames. DataFrames are very similar to RDDs except for the fact that they have a schema. In the previous example, each entry in the RDD has a single element, a word, but this doesn't have to be the case. RDDs and DataFrames often contain complex sets of data, and setting them against a schema allows you to make more structured queries.
A schema is basically a table that we want the data to fit into. As you'll see soon, we have to define a name for the columns in the table as a Python list (in this case, the list ['word']).
Take a look at the following:
words_df = words.map(lambda x: (x, )).toDF(['word']) out = words_df.filter("word like '%ing'") out.take(5)
This code follows on from the previous code block and needs to run in the same PySpark session. The first line does two things; first, it uses the map
method of the RDD to wrap the words up as tuples. This is important because DataFrames need schemas, and the schema can't be just a single value; it has to be a list, tuple, or row. The second part of the first line builds a new DataFrame from this map, and the DataFrame has a schema with a single column called word
.
Again we use the filter
method to pull out some data we're interested in. This time, however, we're not passing a function but a string. Database users amongst you will recognize the syntax as the same as the where
clause in an SQL statement. We're asking for every row where the column word is like '%ing'
, and the percent sign matches any text in SQL, so this means every row where the word ends in ing
.
Now you've run a few tasks in Spark, you can go back to your browser at localhost:4040 and you should see several completed tasks that have run on your Spark cluster.
This has been a very whirlwind tour of the basics of Apache Spark. Hopefully, you now understand what Spark is and how you can program in it. See the Apache Spark website for examples, documentation, and other information on using Spark (Figure 3). The real advantage of Spark is when you're dealing with massive datasets. You can create DataFrames on the fly and query them efficiently across massive clusters of computers. It's not unusual to have Spark clusters with well over a terabyte of RAM, where huge datasets can sit and be processed without ever hitting the disk, leading to really powerful analyses taking place incredibly quickly.
Infos
- Apache Spark: http://spark.apache.org
- Download Spark: http://spark.apache.org/downloads.html
- Download words file: http://www.linuxvoice.com/words
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.