Understanding data stream processing
Watermarks
Watermarks offer a compromise between completeness and fast response. A watermark (T
) contains a timestamp T
and flows with the remaining messages in the data stream. The watermark indicates that, at the time of reception, (probably) no further messages with an event time before or including T
are present in the data stream. The stream processor can then assume that it has seen everything up to T
and respond accordingly. If an event with timestamp <= T
appears later on, it is classified as a delayed event according to the definition. The delayed event can either be used to complete the computations and output a new result, it can be written to a side channel for error analysis, or it can be ignored completely.
Watermarks are typically based on simple mechanisms used to define the degree of disorder in the data stream. For example: If you see an event with an event time of X
, you have probably seen all the events up to event time X - 4
and can create a corresponding watermark. With this scheme, you can only assume that you have seen everything up to Episode III in Star Wars VII. Episode I and II were already late events, because after episode VI a watermark (II) could have been created. With such a high degree of disorder, you would have to wait a very long time to be sure that the story was fully known.
Turning to the example of Alice, I would have to wait with the join for Play(Spaceballs)
with the appropriate impressions until I receive a watermark from all input data streams that is greater or equal to the timestamp of the playback event. This is the only way to keep to the agreement on watermarks in operators that read from different data streams. From the operator's point of view, this can also be imagined as having your own event time clock, with hands that only move to reflect the minimum of all last-seen watermarks per input data stream. When this pointer moves, all time-based actions up to this time can be processed.
Data Memory
To allow for more complex applications, a stream processing framework needs some form of persistent data storage. Storage is often outsourced to an external component, such as a centralized database, or to internal or local data storage. Separating the processing of data from storage offers some advantages, but it can also cause complications, such as higher latency. Local storage is therefore an important component of stream processing; it keeps the state of an operator in RAM or, if necessary, stores it on hard disk/SSD. Local storage is typically faster, and it allows the framework to provide its own consistency models in a relatively easy and resource-efficient way. The price of local storage is that the state of an operator now has to be managed by the framework. The framework also has to take care of fault tolerance, reliability, scalability, and expandability.
In the case of Alice's video data, we need a (local) data store to buffer the display and play back events within the Join operator until the matching watermark signals that the input is complete or complete enough to process. This ensures that the output includes both Star Wars and Scary Movie as ads for the Spaceballs playback and that personalization is handled correctly.
Reliability and Consistency Models
Stream processing usually processes one incoming event after another (although in principle it is possible to combine several events into groups or batches and process them en bloc). It is important to establish a consistency model around these events. The model considers how the event or its message is processed in case of an error. If there are no guarantees, the model is known as "At Most Once," which means that a message may or may not be processed. "At Least Once" means that the system ensures that each message is processed at least once, and "Exactly Once" means the message is processed exactly once.
Ultimately, it doesn't matter how often a message is processed as long as it only affects the internal states as defined by these guarantees and has no side effects. "Exactly Once," in this case, means that each message influences the internal state exactly once only. It may happen that the message is processed multiple times, but the state, for example a counter, is always consistent and does not count an event twice! Side effects can also be included separately, for example, at the sinks with "Exactly Once End-to-End," where the results are also produced exactly once.
These forms of fault tolerance need two things: the ability to create a distributed, consistent snapshot of the entire running system and the ability to restore this state on reboot while continuing reading from the sources at the exact position where the snapshot was taken.
The different stream processing frameworks use different approaches to ensure that these guarantees are met. You'll find comprehensive descriptions of the fault tolerance approaches in the framework documentation.
« Previous 1 2 3 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.
-
Canonical Bumps LTS Support to 12 years
If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
-
Fedora 40 Beta Released Soon
With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
-
New Pentesting Distribution to Compete with Kali Linux
SnoopGod is now available for your testing needs
-
Juno Computers Launches Another Linux Laptop
If you're looking for a powerhouse laptop that runs Ubuntu, the Juno Computers Neptune 17 v6 should be on your radar.