Punctuated Data Streams - Concrete Example


Consider a warehouse of temperature-sensitive items. Sensors are scattered throughout the warehouse, sending reports of temperature data at various intervals to a central system, which derives the maximum temperature reported at any sensor each hour. We can write a query that unions the items from all sensors, groups that result by the hour, and determine the maximum temperature for each hour. Unfortunately, a standard group-by operator waits until it reads the final data item to emit any results. Since the data streams have no pre-determined end, the group by operator will never produce output. To deal with this problem, punctuation is inserted into the data stream periodically, stating that there will no longer be data items for a given hour. The group by operator can use these punctuations to emit results for the hour described by the punctuation, and discard the state that was required for those results. In the example below, Tuples are blue, and punctuations are red. The punctuations describe when an hour has passed.
In this example, tuples are sent through the system every 30 minutes. They are cached by the group by operator so that we can determine the maximum temperature for a given hour. You can see that, when a punctuation arrives to the group by operator, all tuples that match that punctuation are purged from the cache. The operator is also able to output the maximum result for that hour.

Last modified by Pete Tucker on 26 August 2005.