Title: CNODE: clustering of set-valued non-ordered discrete data

Authors: Sunil Kumar, Shamik Sural, Alok Watve, Sakti Pramanik

Addresses: School of Information Technology, Indian Institute of Technology, Kharagpur – 721 302, India. ' School of Information Technology, Indian Institute of Technology, Kharagpur – 721 302, India. ' Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA. ' Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824, USA

Abstract: This paper introduces a clustering technique named |Clustering of set-valued Non-Ordered DiscretE data| (CNODE), in which each data item is a vector having a set of non-ordered discrete values per dimension. Since usual definitions of distance like Euclidean and Manhattan do not hold for |non-ordered discrete data space| (NDDS), other measures like Hamming distance are often used to define distance between vectors having single-valued discrete dimensions. Such type of distance is not meaningful for set-valued dimensions and hence, we propose a similarity measure based on set intersection for clustering set-valued vectors. We also suggest a new measure for determining quality of clustering named |lines of clustroids| (LOC) for this type of data. In contrast to other existing clustering techniques in NDDS, CNODE does not rely on any kind of pre-processing of dataset. Experiments with synthetic and real datasets show that CNODE is robust to data variations, scalable to large dataset size and efficient for high dimensions.

Keywords: clustering; set-valued data; non-ordered discrete data; categorical data; intersection coefficient; clustroids.

DOI: 10.1504/IJDMMM.2009.027288

International Journal of Data Mining, Modelling and Management, 2009 Vol.1 No.3, pp.310 - 334

Published online: 19 Jul 2009 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article