Authors: Huang-Cheng Kuo, Tsung-Lung Lee, Jen-Peng Huang
Addresses: Department of Computer Science and Information Engineering, National Chiayi University, Chia-Yi City 600, Taiwan. ' Department of Computer Science and Information Engineering, National Chiayi University, Chia-Yi City 600, Taiwan. ' Department of Information Management, Southern Taiwan University, Tainan County 710, Taiwan
Abstract: Cluster analysis is frequently used to study the trend of gene expression behaviours from microarray time series data. We adopt a partitioning-based clustering algorithm for such a task. After time series are discritised into sequences, a sequential pattern mining technique is applied to find patterns as the initial clusters. Longest Common Subseries Similarity is used to measure the similarity between time series which overcomes the |shift-effect| influence. An object is re-assigned to the cluster which has most objects within the k nearest neighbours of the object. Similarity measurements, like Pearson correlation coefficient, are used to determine the neighbours.
Keywords: gene expression; time series; cluster analysis; similarity measurement; k-nearest neighbours; k-NN; sequential pattern mining; microarrays; partitioning-based clustering algorithms; initial clusters; longest common subseries similarity; shift-effect; Pearson correlation coefficient; data mining; business intelligence.
International Journal of Business Intelligence and Data Mining, 2010 Vol.5 No.1, pp.56 - 76
Published online: 14 Dec 2009 *Full-text access for editors Access for subscribers Purchase this article Comment on this article