The heavy frequency vector-based text clustering
by Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu
International Journal of Business Intelligence and Data Mining (IJBIDM), Vol. 1, No. 1, 2005

Abstract: The VSM with TF-IDF is a popular approach to represent a document. But it is not very fit for clustering in a dynamic or changing corpus because we have to update the TF-IDF value of every dimension of every VSM vector when we add a new file into the corpus. Furthermore, popular feature selection methods, such as DF, IG and chi, need some global corpus information before clustering. We present the heavy frequency vector, which considers only the most frequent words in a document. Since an HFV does not contain any global corpus information, it is easy to implement incremental clustering, especially in dynamic or changing corpus. We compare the HFV-based K-means model with the traditional VSM-based K-means model with different feature selection methods. The results show that the HFV model has better precision than others. However, the complexity of HFV model is greater than others.

Online publication date: Tue, 05-Jul-2005

The full text of this article is only available to individual subscribers or to users at subscribing institutions.

 
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.

Pay per view:
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.

Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Business Intelligence and Data Mining (IJBIDM):
Login with your Inderscience username and password:

    Username:        Password:         

Forgotten your password?


Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.

If you still need assistance, please email subs@inderscience.com