Punctuated Data Streams - Concrete Example
Example
Consider a warehouse
of temperature-sensitive items. Sensors are
scattered throughout the warehouse, sending
reports of temperature data at various intervals
to a central system, which derives the maximum
temperature reported at any sensor each hour.
We can write a query that unions the items from
all sensors, groups that result by the hour, and
determine the maximum temperature for each hour.
Unfortunately, a standard group-by operator
waits until it reads the final data item to
emit any results. Since the data streams have
no pre-determined end, the group by operator
will never produce output. To deal with this
problem, punctuation is inserted into the data
stream periodically, stating that there will
no longer be data items for a given hour. The
group by operator can use these punctuations to
emit results for the hour described by the punctuation,
and discard the state that was required for those results.
In the example below, Tuples are blue, and punctuations are red.
The punctuations describe when an hour has passed.
In this example, tuples are sent through the
system every 30 minutes. They are cached by the group by
operator so that we can determine the maximum temperature for a given
hour. You can see that, when a punctuation arrives to the group by
operator, all tuples that match that punctuation are purged from
the cache. The operator is also able to output the maximum
result for that hour.
Last modified by Pete Tucker on 26 August 2005.