Punctuated Data Streams

Abstract

As most current query processing architectures are already pipelined, it seems logical to try to extend them from stored files to data streams. However, two classes of query operators are impractical for processing long or infinite data streams. Unbounded stateful operators (such as join) maintain state with no upper bound, and therefore may run out of memory. Blocking operators (such as sort) read the entire input before emitting an output, and therefore might never produce a result. We believe that a priori knowledge of a data stream can permit the use of such operators. This knowledge can be expressed in the form of punctuations.

Punctuations in a stream mark the end of substreams, allowing us to view an infinite stream as a combination of finite streams. We introduce three kinds of invariants to specify the proper behavior of query operators in the presence of punctuation. Pass invariants unblock blocking operators by defining when a such an operator can pass results on. Keep Invariants decrease the size of state by defining when a stateful operator can discard unneeded state. Propagation invariants define when and how an operator can pass punctuation on, so later operators can take advantage of them.


Basic Idea

In the animation below, a stream generator outputs data items with similar structure. The output consists of regular data (colored boxes that do not contain a 'P'), and punctuations (colored boxes that contain a 'P'). A stream processor receives the data, and determines what should be output. In this example, the stream processor must wait until it has seen all the data before it can output a result. Punctuations are embedded in the output stream as well. Each punctuation is a pattern describing a subset of the data items. In this case, the color of the punctuation matches the color of the data items it matches.

Notice that the stream processor caches each data item as it arrives. Later, when a punctuation arrives, the stream processor outputs its results based on data items that match the punctuation (have the same color). Also, those data items are purged from the processor's cache. These are the two benefits of punctuated data streams.

The semantics of a punctuated stream are that there are no tuples in the stream that match the punctuation. This is so the stream processor can be certain it can output results based on a punctuation. Note in the stream above that there are no data items following a punctuation that are of the same color.


Functions, Punctuation Schemes, and Invariants

Functions - Functions commonly used when processing punctuations.
Punctuation Format - The format we use to designate punctuations in XML.
Pass Invariants - define when a blocking operator can correctly pass a subset of its result before all data has been received.
Propagation Invariants - define when an oerator can correctly pass punctuation.
Keep Invariants - define what data items an unbounded stateful operator must keep in state.

Links

Real World Example using a warehouse containing temperature sensitive merchandise
Resources used for this research so far.
My Ph.D. Thesis (PDF)on Punctuated Streams

Code to Download

The following code models stream processors acting on punctuated data streams using Haskell. If you use Hugs, you must use the -98 option, since the model uses multiple parameter classes.
Operators punctuated stream processing model
Testing testing harness for the model, including code to simulate the real world example

Credits

This work is part of a joint research project called Niagara between Portland State University, University of Wisconsin, and Whitworth University.

Last modified by Pete Tucker on 26 August 2005.