Apache Spark
FAQ
Spread your processing load across hundreds of machines as easily as running it locally.
Q: Apache Spark? I've wanted to set fire to my Apache web server more than a few times – usually when I'm elbows-deep in a config file that just refuses to work as I want it to. Is that what it's for?
A: I've been there, too, but no. The web server commonly know as Apache is officially called the Apache HTTP Server. The Apache Software Foundation manages hundreds of projects, and only a few of them are related to web servers. Apache Spark is a programming platform (Figure 1).
Q: But, there are hundreds of programming platforms already. What does Apache Spark have that others don't?
A: To understand Spark, you have to understand the problem that it's designed to solve: Processing Big Data.
Q: Ah, Big Data! The buzzword du jour of the computing industry. Let's start with the obvious Question: Just how big does Big Data have to be to need Spark? Gigabytes? Terabytes? Petabytes … er … jibblywhatsitbytes?
A: With Big Data, it's better not to think of the size in terms of absolute numbers. For our purposes, we'll say that data becomes big once there's too much to process on a single machine at the speed you need it. All Big Data technologies are based around the idea that you need to coordinate your processing across a group of machines to get the throughput you need.
Q: What's so hard about that? You just shuffle the data around to different machines, run the same command on them, and then you're done. In fact, give me a minute; I think I can put together a single command to SCP data to other machines and then use SSH to run the commands on them.
A: You don't even need to write a complex command if you just want to load balance data processing across various machines, the Gnu Parallel tool has support for that baked in. Although this approach can work really well for some simple examples, (e.g., if you want to recompress a large number of images), it can very Quickly get complex or even impossible. For example, if the processing of the data involved uses the results of some other aspects of the processing.
Q: It sounds like you're just making up problems now. What real-world issues does this help with?
A: Probably the best example of a problem that Spark solves is machine learning.
Q: As in artificial intelligence?
A: Yep. Machine learning can work in many ways, but consider this one. You push data through an algorithm that tries to intuit something about the data. However, at the same time, it's learning from the data that it sees. The two separate items, learning and processing, are happening at the same time. For this to work, there has to be a link between the separate computers doing the processing (so that they can all learn in the same way), but at the same time, they want to distribute their processing as much as possible.
Q: This sounds a bit like black magic. How does Spark manage to keep a consistent, yet changing model across multiple machines?
A: The concept that sits at the heart of the Spark platform is Resilient Distributed Datasets (RDDs). An RDD is an immutable collection of data. That means that once it's created, an RDD doesn't change. Instead, transformations that happen to an RDD create a new RDD. The most basic use of RDDs is something you may be familiar with: MapReduce.
Q: Ah, yes. I've heard of that. I can't remember what it is though!
A: Very simply, MapReduce is another paradigm for processing Big Data. It goes through every item in your dataset and runs a function on it (this is the map), and then it combines all these results into a single output (this is the reduce). For example, if you had a lot of images and you wanted to balance their brightness and then create a montage of them, the map stage would be going through each image in turn, and the reduce stage would be bringing each balanced image together into a montage.
Spark is heavily inspired by MapReduce and perhaps can be thought of as a way to expand the MapReduce concept to include more features. In the above example, there would be three RDDs. The first would be your source images, the second would be created by a transform that balances the brightness of your images, and the third would be created by the transform that brings them all together to make the montage.
Q: Ah, OK. How does this work with artificial intelligence again?
A: Well, there's two answers to that. The first is complex and doesn't fit in a two page FAQ; the second is: "don't worry about that, someone's done it for you."
Q: Ah, there's a library to use?
A: To call Spark's MLlib a machine learning library seems to undersell it, but because that's what it's actual name stands for, we can't really argue with it. Essentially, it is a framework that comes complete with all the common machine learning algorithms ready to go. Although it does reQuire you to do some programming, 10 lines of simple code is enough to train and run a machine learning model on a vast array of data split over many machines (Figure 2). It really does make massively parallel machine learning possible for almost anyone with a little programming experience and a few machines.

Q: OK, so that's machine learning. Are there any other tasks that Spark really excels at?
A: There's nothing else that Spark makes Quite as easy as machine learning, but one other area that is gaining popularity is stream data processing. This is where you have a constant flow of data in and you want to process it as soon as it comes in (and typically send it on to some time-series database). The Spark Structured Stream framework makes it easy to perform just this sort of processing.
It's wrong to think of Spark as just for machine learning and streaming, though. These are just two areas that happen to have frameworks on Spark. The tool itself is general-purpose and can be useful for almost any distributed computing.
Q: So, all I need to create my own streaming machine learning system is a couple of machines, a Linux distro, and Spark?
A: Not Quite. You also need some way of sharing the data amongst the machines. Hadoop's Distributed File System (HDFS) is the most popular way of doing this. This provides a way of handling more data than any one machine can handle and efficiently processing it without moving it between computers more than is necessary. HDFS also provides resilience in case one or more machines break. After all, the more machines you've got, the higher the chances are that one breaks, and you don't want your cluster to go down just because one machine has some issues.
Q: Right. I'm off to build a business empire built on computers that are more intelligent than I am.
A: Good luck.
Buy Linux Magazine
Direct Download
Read full article as PDF:
News
-
Mozilla VPN Now Available for Linux
The promised subscription-based VPN service from Mozilla is now available for the Linux platform.
-
Wayland and New App Menu Coming to KDE
The 2021 roadmap for the KDE desktop environment includes some exciting features and improvements.
-
Deepin 20.1 has Arrived
Debian-based Deepin 20.1 has been released with some interesting new features.
-
CloudLinux Commits Over 1 Million Dollars to CentOS Replacement
An open source, drop-in replacement for CentOS is on its way.
-
Linux Mint 20.1 Beta has Been Released
The first beta of Linux Mint, Ulyssa, is now available for downloading.
-
Manjaro Linux 20.2 has Been Unleashed
The latest iteration of Manjaro Linux has been released with a few interesting new features.
-
Patreon Project Looks to Bring Linux to Apple Silicon
Developer Hector Martin has created a patreon page to fund his work on developing a port of Linux for Apple Silicon Macs.
-
A New Chrome OS-Like Ubuntu Remix is Now Available
Ubuntu Web looks to be your Chrome OS alternative.
-
System76 Refreshes the Galago Pro Laptop
Linux hardware maker has revamped one of their most popular laptops.
-
Dell Will Soon Enable Privacy Controls for Linux Hardware
Dell makes it possible for Linux users to disable webcams and microphones.