Understanding data stream processing
Watermarks
Watermarks offer a compromise between completeness and fast response. A watermark (T
) contains a timestamp T
and flows with the remaining messages in the data stream. The watermark indicates that, at the time of reception, (probably) no further messages with an event time before or including T
are present in the data stream. The stream processor can then assume that it has seen everything up to T
and respond accordingly. If an event with timestamp <= T
appears later on, it is classified as a delayed event according to the definition. The delayed event can either be used to complete the computations and output a new result, it can be written to a side channel for error analysis, or it can be ignored completely.
Watermarks are typically based on simple mechanisms used to define the degree of disorder in the data stream. For example: If you see an event with an event time of X
, you have probably seen all the events up to event time X - 4
and can create a corresponding watermark. With this scheme, you can only assume that you have seen everything up to Episode III in Star Wars VII. Episode I and II were already late events, because after episode VI a watermark (II) could have been created. With such a high degree of disorder, you would have to wait a very long time to be sure that the story was fully known.
Turning to the example of Alice, I would have to wait with the join for Play(Spaceballs)
with the appropriate impressions until I receive a watermark from all input data streams that is greater or equal to the timestamp of the playback event. This is the only way to keep to the agreement on watermarks in operators that read from different data streams. From the operator's point of view, this can also be imagined as having your own event time clock, with hands that only move to reflect the minimum of all last-seen watermarks per input data stream. When this pointer moves, all time-based actions up to this time can be processed.
Data Memory
To allow for more complex applications, a stream processing framework needs some form of persistent data storage. Storage is often outsourced to an external component, such as a centralized database, or to internal or local data storage. Separating the processing of data from storage offers some advantages, but it can also cause complications, such as higher latency. Local storage is therefore an important component of stream processing; it keeps the state of an operator in RAM or, if necessary, stores it on hard disk/SSD. Local storage is typically faster, and it allows the framework to provide its own consistency models in a relatively easy and resource-efficient way. The price of local storage is that the state of an operator now has to be managed by the framework. The framework also has to take care of fault tolerance, reliability, scalability, and expandability.
In the case of Alice's video data, we need a (local) data store to buffer the display and play back events within the Join operator until the matching watermark signals that the input is complete or complete enough to process. This ensures that the output includes both Star Wars and Scary Movie as ads for the Spaceballs playback and that personalization is handled correctly.
Reliability and Consistency Models
Stream processing usually processes one incoming event after another (although in principle it is possible to combine several events into groups or batches and process them en bloc). It is important to establish a consistency model around these events. The model considers how the event or its message is processed in case of an error. If there are no guarantees, the model is known as "At Most Once," which means that a message may or may not be processed. "At Least Once" means that the system ensures that each message is processed at least once, and "Exactly Once" means the message is processed exactly once.
Ultimately, it doesn't matter how often a message is processed as long as it only affects the internal states as defined by these guarantees and has no side effects. "Exactly Once," in this case, means that each message influences the internal state exactly once only. It may happen that the message is processed multiple times, but the state, for example a counter, is always consistent and does not count an event twice! Side effects can also be included separately, for example, at the sinks with "Exactly Once End-to-End," where the results are also produced exactly once.
These forms of fault tolerance need two things: the ability to create a distributed, consistent snapshot of the entire running system and the ability to restore this state on reboot while continuing reading from the sources at the exact position where the snapshot was taken.
The different stream processing frameworks use different approaches to ensure that these guarantees are met. You'll find comprehensive descriptions of the fault tolerance approaches in the framework documentation.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
So Long Neofetch and Thanks for the Info
Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
-
Ubuntu 24.04 Comes with a “Flaw"
If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
-
Canonical Releases Ubuntu 24.04
After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.