Understanding data stream processing

Watermarks

Watermarks offer a compromise between completeness and fast response. A watermark (T) contains a timestamp T and flows with the remaining messages in the data stream. The watermark indicates that, at the time of reception, (probably) no further messages with an event time before or including T are present in the data stream. The stream processor can then assume that it has seen everything up to T and respond accordingly. If an event with timestamp <= T appears later on, it is classified as a delayed event according to the definition. The delayed event can either be used to complete the computations and output a new result, it can be written to a side channel for error analysis, or it can be ignored completely.

Watermarks are typically based on simple mechanisms used to define the degree of disorder in the data stream. For example: If you see an event with an event time of X, you have probably seen all the events up to event time X - 4 and can create a corresponding watermark. With this scheme, you can only assume that you have seen everything up to Episode III in Star Wars VII. Episode I and II were already late events, because after episode VI a watermark (II) could have been created. With such a high degree of disorder, you would have to wait a very long time to be sure that the story was fully known.

Turning to the example of Alice, I would have to wait with the join for Play(Spaceballs) with the appropriate impressions until I receive a watermark from all input data streams that is greater or equal to the timestamp of the playback event. This is the only way to keep to the agreement on watermarks in operators that read from different data streams. From the operator's point of view, this can also be imagined as having your own event time clock, with hands that only move to reflect the minimum of all last-seen watermarks per input data stream. When this pointer moves, all time-based actions up to this time can be processed.

Data Memory

To allow for more complex applications, a stream processing framework needs some form of persistent data storage. Storage is often outsourced to an external component, such as a centralized database, or to internal or local data storage. Separating the processing of data from storage offers some advantages, but it can also cause complications, such as higher latency. Local storage is therefore an important component of stream processing; it keeps the state of an operator in RAM or, if necessary, stores it on hard disk/SSD. Local storage is typically faster, and it allows the framework to provide its own consistency models in a relatively easy and resource-efficient way. The price of local storage is that the state of an operator now has to be managed by the framework. The framework also has to take care of fault tolerance, reliability, scalability, and expandability.

In the case of Alice's video data, we need a (local) data store to buffer the display and play back events within the Join operator until the matching watermark signals that the input is complete or complete enough to process. This ensures that the output includes both Star Wars and Scary Movie as ads for the Spaceballs playback and that personalization is handled correctly.

Reliability and Consistency Models

Stream processing usually processes one incoming event after another (although in principle it is possible to combine several events into groups or batches and process them en bloc). It is important to establish a consistency model around these events. The model considers how the event or its message is processed in case of an error. If there are no guarantees, the model is known as "At Most Once," which means that a message may or may not be processed. "At Least Once" means that the system ensures that each message is processed at least once, and "Exactly Once" means the message is processed exactly once.

Ultimately, it doesn't matter how often a message is processed as long as it only affects the internal states as defined by these guarantees and has no side effects. "Exactly Once," in this case, means that each message influences the internal state exactly once only. It may happen that the message is processed multiple times, but the state, for example a counter, is always consistent and does not count an event twice! Side effects can also be included separately, for example, at the sinks with "Exactly Once End-to-End," where the results are also produced exactly once.

These forms of fault tolerance need two things: the ability to create a distributed, consistent snapshot of the entire running system and the ability to restore this state on reboot while continuing reading from the sources at the exact position where the snapshot was taken.

The different stream processing frameworks use different approaches to ensure that these guarantees are met. You'll find comprehensive descriptions of the fault tolerance approaches in the framework documentation.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Apache StreamPipes

    You don't need to be a stream processing expert to create useful custom solutions with Apache StreamPipes. We'll use StreamPipes to build a simple app that calculates when the International Space Station will fly overhead.

  • IBM System S: Stream Computing

    IBM's new System S analyzes business data at their creation.

  • FAQ – Apache Spark

    Spread your processing load across hundreds of machines as easily as running it locally.

  • ApacheCon 2009 Free Live Stream

    The Apache Software Foundation (ASF) is holding ApacheCon US 2009 from November 2-6 in Oakland, California. The foundation for a free webserver is also celebrating its 10th birthday. In honor of this 10th birthday, the celebration includes three days of the conference program available as a FREE Live stream.

  • DCCP

    The DCCP protocol gives multimedia developers a powerful alternative to TCP and UDP.

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95

News