Apache Spark
FAQ
Spread your processing load across hundreds of machines as easily as running it locally.
Q: Apache Spark? I've wanted to set fire to my Apache web server more than a few times – usually when I'm elbows-deep in a config file that just refuses to work as I want it to. Is that what it's for?
A: I've been there, too, but no. The web server commonly know as Apache is officially called the Apache HTTP Server. The Apache Software Foundation manages hundreds of projects, and only a few of them are related to web servers. Apache Spark is a programming platform (Figure 1).
Q: But, there are hundreds of programming platforms already. What does Apache Spark have that others don't?
A: To understand Spark, you have to understand the problem that it's designed to solve: Processing Big Data.
Q: Ah, Big Data! The buzzword du jour of the computing industry. Let's start with the obvious Question: Just how big does Big Data have to be to need Spark? Gigabytes? Terabytes? Petabytes … er … jibblywhatsitbytes?
A: With Big Data, it's better not to think of the size in terms of absolute numbers. For our purposes, we'll say that data becomes big once there's too much to process on a single machine at the speed you need it. All Big Data technologies are based around the idea that you need to coordinate your processing across a group of machines to get the throughput you need.
Q: What's so hard about that? You just shuffle the data around to different machines, run the same command on them, and then you're done. In fact, give me a minute; I think I can put together a single command to SCP data to other machines and then use SSH to run the commands on them.
A: You don't even need to write a complex command if you just want to load balance data processing across various machines, the Gnu Parallel tool has support for that baked in. Although this approach can work really well for some simple examples, (e.g., if you want to recompress a large number of images), it can very Quickly get complex or even impossible. For example, if the processing of the data involved uses the results of some other aspects of the processing.
Q: It sounds like you're just making up problems now. What real-world issues does this help with?
A: Probably the best example of a problem that Spark solves is machine learning.
Q: As in artificial intelligence?
A: Yep. Machine learning can work in many ways, but consider this one. You push data through an algorithm that tries to intuit something about the data. However, at the same time, it's learning from the data that it sees. The two separate items, learning and processing, are happening at the same time. For this to work, there has to be a link between the separate computers doing the processing (so that they can all learn in the same way), but at the same time, they want to distribute their processing as much as possible.
Q: This sounds a bit like black magic. How does Spark manage to keep a consistent, yet changing model across multiple machines?
A: The concept that sits at the heart of the Spark platform is Resilient Distributed Datasets (RDDs). An RDD is an immutable collection of data. That means that once it's created, an RDD doesn't change. Instead, transformations that happen to an RDD create a new RDD. The most basic use of RDDs is something you may be familiar with: MapReduce.
Q: Ah, yes. I've heard of that. I can't remember what it is though!
A: Very simply, MapReduce is another paradigm for processing Big Data. It goes through every item in your dataset and runs a function on it (this is the map), and then it combines all these results into a single output (this is the reduce). For example, if you had a lot of images and you wanted to balance their brightness and then create a montage of them, the map stage would be going through each image in turn, and the reduce stage would be bringing each balanced image together into a montage.
Spark is heavily inspired by MapReduce and perhaps can be thought of as a way to expand the MapReduce concept to include more features. In the above example, there would be three RDDs. The first would be your source images, the second would be created by a transform that balances the brightness of your images, and the third would be created by the transform that brings them all together to make the montage.
Q: Ah, OK. How does this work with artificial intelligence again?
A: Well, there's two answers to that. The first is complex and doesn't fit in a two page FAQ; the second is: "don't worry about that, someone's done it for you."
Q: Ah, there's a library to use?
A: To call Spark's MLlib a machine learning library seems to undersell it, but because that's what it's actual name stands for, we can't really argue with it. Essentially, it is a framework that comes complete with all the common machine learning algorithms ready to go. Although it does reQuire you to do some programming, 10 lines of simple code is enough to train and run a machine learning model on a vast array of data split over many machines (Figure 2). It really does make massively parallel machine learning possible for almost anyone with a little programming experience and a few machines.
Q: OK, so that's machine learning. Are there any other tasks that Spark really excels at?
A: There's nothing else that Spark makes Quite as easy as machine learning, but one other area that is gaining popularity is stream data processing. This is where you have a constant flow of data in and you want to process it as soon as it comes in (and typically send it on to some time-series database). The Spark Structured Stream framework makes it easy to perform just this sort of processing.
It's wrong to think of Spark as just for machine learning and streaming, though. These are just two areas that happen to have frameworks on Spark. The tool itself is general-purpose and can be useful for almost any distributed computing.
Q: So, all I need to create my own streaming machine learning system is a couple of machines, a Linux distro, and Spark?
A: Not Quite. You also need some way of sharing the data amongst the machines. Hadoop's Distributed File System (HDFS) is the most popular way of doing this. This provides a way of handling more data than any one machine can handle and efficiently processing it without moving it between computers more than is necessary. HDFS also provides resilience in case one or more machines break. After all, the more machines you've got, the higher the chances are that one breaks, and you don't want your cluster to go down just because one machine has some issues.
Q: Right. I'm off to build a business empire built on computers that are more intelligent than I am.
A: Good luck.
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.
-
VirtualBox 7.1.4 Includes Initial Support for Linux kernel 6.12
The latest version of VirtualBox has arrived and it not only adds initial support for kernel 6.12 but another feature that will make using the virtual machine tool much easier.
-
New Slimbook EVO with Raw AMD Ryzen Power
If you're looking for serious power in a 14" ultrabook that is powered by Linux, Slimbook has just the thing for you.
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.