Title: Paving the way for next generation data-stream clustering: towards a unique and statistically valid cluster structure at any time step

Authors: Pascal Cuxac, Alain Lelu, Martine Cadot

Addresses: INIST-CNRS, 2 allee du Parc de Brabois, CS 10310, 54 519-Vandoeuvre-les-Nancy Cedex, France. ' LASELDI/Universite de Franche-Comte, 30 rue Megevand, 25030 Besancon Cedex, France; LORIA Campus Scientifique, BP 239-54506 Vandoeuvre-les-Nancy Cedex, France. ' LORIA Campus Scientifique, BP 239-54506 Vandoeuvre-les-Nancy Cedex, France; UHP Campus Scientifique, BP 239-54506 Vandoeuvre-les-Nancy Cedex, France

Abstract: In the domain of data-stream clustering, e.g., dynamic text mining as our application domain, our goal is two-fold and a long term one: 1) at each data input, the resulting cluster structure has to be unique, independent of the order the input vectors are presented; 2) this structure has to be meaningful for an expert, e.g., not composed of a huge |catch-all| cluster in a cloud of tiny specific ones, as is often the case with large sparse data tables. The first preliminary condition is satisfied by our Germen density-mode seeking algorithm, but the relevance of the clusters vis-a-vis expert judgment relies on the definition of a data density, relying itself on the type of graph chosen for embedding the similarities between text inputs. Having already demonstrated the dynamic behaviour of Germen algorithm, we focus here on appending a Monte-Carlo method for extracting statistically valid inter-text links, which looks promising applied both to an excerpt of the Pascal bibliographic database, and to the Reuters-RCV1 news test collection. Though not being a central issue here, the time complexity of our algorithms is eventually discussed.

Keywords: data stream clustering; text mining; incremental algorithm; randomisation test; density mode clustering; graph validation; data mining.

DOI: 10.1504/IJDMMM.2011.042933

International Journal of Data Mining, Modelling and Management, 2011 Vol.3 No.4, pp.341 - 360

Published online: 08 Oct 2011 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article