Title: Cluster analysis on time series gene expression data

Authors: Huang-Cheng Kuo, Tsung-Lung Lee, Jen-Peng Huang

Addresses: Department of Computer Science and Information Engineering, National Chiayi University, Chia-Yi City 600, Taiwan. ' Department of Computer Science and Information Engineering, National Chiayi University, Chia-Yi City 600, Taiwan. ' Department of Information Management, Southern Taiwan University, Tainan County 710, Taiwan

Abstract: Cluster analysis is frequently used to study the trend of gene expression behaviours from microarray time series data. We adopt a partitioning-based clustering algorithm for such a task. After time series are discritised into sequences, a sequential pattern mining technique is applied to find patterns as the initial clusters. Longest Common Subseries Similarity is used to measure the similarity between time series which overcomes the |shift-effect| influence. An object is re-assigned to the cluster which has most objects within the k nearest neighbours of the object. Similarity measurements, like Pearson correlation coefficient, are used to determine the neighbours.

Keywords: gene expression; time series; cluster analysis; similarity measurement; k-nearest neighbours; k-NN; sequential pattern mining; microarrays; partitioning-based clustering algorithms; initial clusters; longest common subseries similarity; shift-effect; Pearson correlation coefficient; data mining; business intelligence.

DOI: 10.1504/IJBIDM.2010.030299

International Journal of Business Intelligence and Data Mining, 2010 Vol.5 No.1, pp.56 - 76

Published online: 14 Dec 2009 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article