Article: Semi supervised ensemble clustering algorithm for high dimensional genomic data Journal: International Journal of Rapid Manufacturing (IJRAPIDM) 2009 Vol.1 No.2 pp.222 - 236 Abstract: Clustering high-dimensional spaces is a difficult problem which is recurrent in many domains, e.g., in computational biology. Developing effective clustering methods for high dimensional datasets is a challenging problem due to the curse of dimensionality. This paper presents an efficient scalable clustering algorithm designed for high-dimensional data which combines the ideas of linear discriminant analysis (LDA) based on PCA feature extraction along with K-means algorithm to select the most discriminative subspace. Initially, K-means clustering is used to generate class labels and LDA is used for subspace selection towards highest variance and the algorithm is designed to reduce the sum squared errors as much as possible for the partitions, while at the same time keep the partitions far apart as possible. The clustering process is thus, integrated with the subspace selection process based on LDA and the data are then simultaneously clustered while the feature subspaces are selected. Finally, clustering instances are aggregated to generate final clusters based on agglomerative clustering. For medical data, all the dimensions are necessary and the proposed method covers all the dimensions efficiently. Real datasets show that the proposed method outperforms existing methods for clustering high-dimensional genomic data in terms of accuracy. Inderscience Publishers - linking academia, business and industry through research

Title: Semi supervised ensemble clustering algorithm for high dimensional genomic data

Authors: P. Krishnakumari, K. Vivekanandan

Addresses: Department of Computer Science, Sri Ramakrishna College of Arts and Science for Women, Coimbatore-641044, Tamilnadu, India. ' School of Management, Bharathiar University, Coimbatore-641044, Tamilnadu, India

Abstract: Clustering high-dimensional spaces is a difficult problem which is recurrent in many domains, e.g., in computational biology. Developing effective clustering methods for high dimensional datasets is a challenging problem due to the curse of dimensionality. This paper presents an efficient scalable clustering algorithm designed for high-dimensional data which combines the ideas of linear discriminant analysis (LDA) based on PCA feature extraction along with K-means algorithm to select the most discriminative subspace. Initially, K-means clustering is used to generate class labels and LDA is used for subspace selection towards highest variance and the algorithm is designed to reduce the sum squared errors as much as possible for the partitions, while at the same time keep the partitions far apart as possible. The clustering process is thus, integrated with the subspace selection process based on LDA and the data are then simultaneously clustered while the feature subspaces are selected. Finally, clustering instances are aggregated to generate final clusters based on agglomerative clustering. For medical data, all the dimensions are necessary and the proposed method covers all the dimensions efficiently. Real datasets show that the proposed method outperforms existing methods for clustering high-dimensional genomic data in terms of accuracy.

Keywords: gene expression; clustering accuracy; DNA microarrays; k-means clustering; linear discriminant analysis; LDA; principal component analysis; PCA.

DOI: 10.1504/IJRAPIDM.2009.029384

International Journal of Rapid Manufacturing, 2009 Vol.1 No.2, pp.222 - 236

Published online: 28 Nov 2009 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Semi supervised ensemble clustering algorithm for high dimensional genomic data

Keep up-to-date