Understanding data stream processing

All Is Flux

Article from Issue 244/2021

Batch processing strategies won't help if you need to process large volumes of incoming data in real time. Stream processing is a promising alternative to conventional batch techniques.

Stream processing, also known as data stream processing, has been around since the early 1970s, but it has seen a big resurgence of interest in recent years. To understand why stream processing is on the rise, first consider how a conventional program processes data. Traditional software reads a chunk of data all at once and then performs operations on it. This batch technique is fine for certain types of problems, but in other use cases, it is quite limiting – especially in the modern era of parallel processing and big data.

Stream processing instead envisions the data as a continuous flow. New events are processed as they occur. You can envision the program as something like a factory assembly line – a stream of incoming data is analyzed, manipulated, and transformed as it passes through the system. In some cases, parallel streams might arrive separately for the program to analyze, process, and merge together.

Stream processing excels at use cases that require real-time processing of incoming data from large datasets, such as fraud detection software for a credit card company or a program that manages and interprets data from IoT environmental sensors.

A stream processing program consists of sources, nodes/operators, and sinks that are connected by data streams. Sources either read data from external components or generate data themselves. Sinks are responsible for outputting data – to the screen, to files, or to external systems. Operators have at least one input to which a data stream is connected.

One popular form of stream processing known as distributed stateful stream processing opens up almost infinite possibilities for developing complex business logic, analysis processes, and even complete applications. Several open source frameworks offer support for distributed stateful stream processing, which means developers can explore the possibilities of this powerful technique without worrying about underlying network layers or even process synchronization. Adopting an existing framework also lets you take advantage of built-in fault tolerance.

Stream processing Terms

To understand stateful stream processing, imagine the following scenario: A major video-streaming provider has thousands of movies on offer. It is almost impossible for the user to browse through them all to find the right movie that suits the user's preferences and current mood. Therefore, the provider wants to send each user a few personalized suggestions. After the user has browsed the offering for several weeks, the streaming provider can anticipate which movies or series the customer likes. The provider looks at all the suggestions it has made, how the customer continues to browse these offers, and which video the customer watches. This information is available to the provider in two separate data streams: Impressions (suggestions shown) and Plays (video playback). If you look more closely at the data streams for this task, you can derive an architecture like the one shown in Figure 1, which reads the streams, performs further processing, then finally merges the streams to achieve the desired result, which is new recommendations for the reader based on the analysis of past behavior.

Figure 1: Logical data flowchart for a stream-processing application that analyzes and determines the displayed video suggestions (Impressions) that guide a user to a video playback decision (Plays).

Data Stream

A data stream is basically a sequence of data or events. Some definitions distinguish between binary data streams, which occur in music or video data processing, and event data streams, which are discussed in this article. An event in this context is a request that can be considered separately from all others. Event data streams are the model used by typical stream processing frameworks.

A network channel is a data stream that can be read block by block or byte by byte. You could also define a data stream for a network channel in terms of its logical structure – for example, into individual HTTP requests.

A closer look at event data streams reveals some additional details. First of all, each data stream is naturally unlimited: If you read at a certain position in the imaginary data stream, you do not know how many more events exist. If the data stream is cached in some way, there is often the possibility to shift this reading position and thus jump back into the past. But what lies in the future is unknown at the current time. To a certain extent, delimited data streams – data streams with a defined beginning and end – constitute a special case of this general view. Such data streams are typically created when someone artificially splits up an unlimited data stream to process events en bloc. An example could be the logfiles from one day or all the entries for one day. Working with limited data streams is typically referred to as batch processing.

In the video-streaming scenario described earlier in this article, two source data streams have to be processed: Impressions and Plays. The two are completely independent of each other, and they are separately populated with events by different apps or web services. The events can take different routes on their way to the sources of the application in external and probably also distributed systems (e.g., through load balancing or server failure). They thus arrive with a time delay – an effect that must also be taken into account within a distributed application.

Figure 2 shows a more complex physical execution plan for the application shown in Figure 1. The figure shows two parallel instances of sources and operators – and one instance as the data sink. Different connection patterns link these instances, including simple forwarding from one source to a downstream operator or more complex partitioning patterns between Map and Join operators. Partitioning the data stream is necessary to assign all display events associated with a user to playback events. If we can ensure that all events are always processed by a user on the same Join instance, you can easily parallelize the computational work.

Figure 2: Physical execution plan for the stream-processing application in Figure 1 with two parallel instances of each source and operator, but only one sink.


The assignment of display events to playback events should ideally reconstruct all display events before the time of playback exactly as the user has seen them on their end device. But that's easier said than done: An initial simple solution could be for all display events to be cached in the Join and whenever playback occurs, for all associated display events to be searched for and then output. Although this sounds logical, it unfortunately leads to wrong results in the general case.

Imagine the following: Alice has seen movie recommendations for Star Wars and Scary Movie and finally watched Spaceballs. The associated events, which have come from Alice's devices and have migrated through several different layers of the provider's servers, are then read from the sources and processed in the map operators (for example, removing superfluous details). The events are now located in the data streams between Map and Join (Figure 2). The order in which these three events are now read by the Join operator has decisive consequences for the computation results. If Spaceballs is viewed first, there are no previous display events at all. Similarly, only one ad might arrive before playback, and the result would not be complete. This solution does not reconstruct all display events before the time of playback as Alice saw them but only links the display events that occurred before the playback event.

The correct solution must reference the timestamps in the events. This is known as event time processing and is an essential idea in any stream processing framework. The Star Wars analogy is also used in this context to illustrate the differences between these two time concepts and to emphasize that incoming events do not necessarily have to be sorted by event time (see the box entitled "A Galaxy Far Away: Processing Time vs. Event Time").

A Galaxy Far Away: Processing Time vs. Event Time

The processing time is the point in time at which George Lucas and his team processed the events "a long time ago in a galaxy far, far away" to create a movie (i.e., since 1977). The event time, on the other hand, is the time in which the story takes place (Episode I to VI and beyond). Depending on your point of view, one or the other is the right order:

| Sorted by Processing Time: | IV (1977) - V (1980) - VI (1983) - I (1999) - II (2002) - III (2005) || Sorted by Event Time: | I (1999) - II (2002) - III (2005) - IV (1977) - V (1980) - VI (1983) |

The difficult thing about event time processing is to find out when the incoming message stream is complete enough to make statements about it. In the Star Wars context, this can be illustrated by the numbers for the years. For example, if you watched the movies in the order of the calendar years, when can you be sure that you know the whole story? In 1983, for example, it was assumed that the story was complete, but with some delay, further events from before the previously known story were revealed. What is the outlook in 2005? Is the story complete following Episodes I to VI? Do we maybe know this when Episode VII (2015) is released? In this case the answer would unfortunately be no, because in 2016 the movie Rogue One was released, and it is set between Episodes III and IV. Things are similar for a stream processor, because every incoming event poses the question as to whether it has to wait for more data (in which case its knowledge would be more comprehensive) or should it respond (promptly) to the currently available information?

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Apache StreamPipes

    You don't need to be a stream processing expert to create useful custom solutions with Apache StreamPipes. We'll use StreamPipes to build a simple app that calculates when the International Space Station will fly overhead.

  • IBM System S: Stream Computing

    IBM's new System S analyzes business data at their creation.

  • FAQ – Apache Spark

    Spread your processing load across hundreds of machines as easily as running it locally.

  • ApacheCon 2009 Free Live Stream

    The Apache Software Foundation (ASF) is holding ApacheCon US 2009 from November 2-6 in Oakland, California. The foundation for a free webserver is also celebrating its 10th birthday. In honor of this 10th birthday, the celebration includes three days of the conference program available as a FREE Live stream.

  • ApacheCon Europe 2009 -- Live Stream for Free

    From March 25 to March 27 the official conference of the Apache Software Foundation (ASF), ApacheCon Europe 2009 will be held in Amsterdam, Netherlands. Some of the talks will be available here on Linux Magazine for free.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95