Authors: P. Krishnakumari, K. Vivekanandan
Addresses: Department of Computer Science, Sri Ramakrishna College of Arts and Science for Women, Coimbatore-641044, Tamilnadu, India. ' School of Management, Bharathiar University, Coimbatore-641044, Tamilnadu, India
Abstract: Clustering high-dimensional spaces is a difficult problem which is recurrent in many domains, e.g., in computational biology. Developing effective clustering methods for high dimensional datasets is a challenging problem due to the curse of dimensionality. This paper presents an efficient scalable clustering algorithm designed for high-dimensional data which combines the ideas of linear discriminant analysis (LDA) based on PCA feature extraction along with K-means algorithm to select the most discriminative subspace. Initially, K-means clustering is used to generate class labels and LDA is used for subspace selection towards highest variance and the algorithm is designed to reduce the sum squared errors as much as possible for the partitions, while at the same time keep the partitions far apart as possible. The clustering process is thus, integrated with the subspace selection process based on LDA and the data are then simultaneously clustered while the feature subspaces are selected. Finally, clustering instances are aggregated to generate final clusters based on agglomerative clustering. For medical data, all the dimensions are necessary and the proposed method covers all the dimensions efficiently. Real datasets show that the proposed method outperforms existing methods for clustering high-dimensional genomic data in terms of accuracy.
Keywords: gene expression; clustering accuracy; DNA microarrays; k-means clustering; linear discriminant analysis; LDA; principal component analysis; PCA.
International Journal of Rapid Manufacturing, 2009 Vol.1 No.2, pp.222 - 236
Published online: 28 Nov 2009 *Full-text access for editors Access for subscribers Purchase this article Comment on this article