Title: Improving the accuracy of continuous aggregates and mining queries on data streams under load shedding

Authors: Yan-Nei Law, Carlo Zaniolo

Addresses: Bioinformatics Institute, 30 Biopolis Street, #07-01, Matrix, 138671, Singapore. ' UCLA Computer Science Department, 4732 Boelter Hall, Los Angeles, CA 90095, USA

Abstract: Random samples are common in data streams applications due to limitations in data sources and transmission lines, or to load-shedding policies. Here we introduce a formal error model and show that, besides providing accurate estimates, it improves query answer accuracy by exploiting past statistics. The method is general, robust in the presence of concept drift, and minimises uncertainties due to sampling with negligible time and space overhead. We describe the application of the method, and the results obtained for SQL window aggregates, statistical aggregates such as quantiles, and data mining functions such as k-means clustering and naive Bayesian classifiers.

Keywords: load shedding; data streams; query processing; sampling; continuous aggregates; error modelling; query answer accuracy; statistics; data mining.

DOI: 10.1504/IJBIDM.2008.017978

International Journal of Business Intelligence and Data Mining, 2008 Vol.3 No.1, pp.99 - 117

Available online: 25 Apr 2008 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article