Title: The heavy frequency vector-based text clustering

Authors: Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu

Addresses: Department of Computer Science and Engineering, Xi'an Jiaotong University, China. ' Department of Computer Science and Engineering, Xi'an Jiaotong University, China. ' Department of Computer Science and Engineering, Xi'an Jiaotong University, China. ' Department of Computer Science and Engineering, Xi'an Jiaotong University, China

Abstract: The VSM with TF-IDF is a popular approach to represent a document. But it is not very fit for clustering in a dynamic or changing corpus because we have to update the TF-IDF value of every dimension of every VSM vector when we add a new file into the corpus. Furthermore, popular feature selection methods, such as DF, IG and chi, need some global corpus information before clustering. We present the heavy frequency vector, which considers only the most frequent words in a document. Since an HFV does not contain any global corpus information, it is easy to implement incremental clustering, especially in dynamic or changing corpus. We compare the HFV-based K-means model with the traditional VSM-based K-means model with different feature selection methods. The results show that the HFV model has better precision than others. However, the complexity of HFV model is greater than others.

Keywords: text clustering; feature selection; heavy frequency vector; K-means; word frequency; document representation; incremental clustering; text processing; dynamic text; changing text.

DOI: 10.1504/IJBIDM.2005.007317

International Journal of Business Intelligence and Data Mining, 2005 Vol.1 No.1, pp.42 - 53

Published online: 05 Jul 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article