Title: A novel rough semi-supervised k-means algorithm for text clustering

Authors: Lei-yu Tang; Zhen-hao Wang; Shu-dong Wang; Jian-cong Fan; Guo-wei Yue

Addresses: College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China ' College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China ' College of Computer Science and Technology, China University of Petroleum, Qingdao, China ' College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao, China ' Key Laboratory of Mining Disaster Prevention and Control, Shandong University of Science and Technology, Qingdao 266590, China

Abstract: Since many attribute values of high-dimensional sparse data are zero, we combine the approximation set of the rough set theory with the semi-supervised k-means algorithm to propose a rough set-based semi-supervised k-means (RSKmeans) algorithm. Firstly, the proportion of non-zero values is calculated by a few labelled data samples, and a small number of important attributes in each cluster are selected to calculate the clustering centres. Secondly, the approximation set is used to calculate the information gain of each attribute. Thirdly, different attribute values are partitioned into the corresponding approximate sets according to the comparison of information gain with the upper approximation and boundary threshold. Then, the new attributes are increased and the above process is continued to update the clustering centres. The experimental results on text data show that the RSKmeans algorithm can help find the important attributes, filter the invalid information, and improve the performances significantly.

Keywords: rough set; approximation set; k-means algorithm; semi-supervised clustering; high dimensional sparse data.

DOI: 10.1504/IJBIC.2023.130548

International Journal of Bio-Inspired Computation, 2023 Vol.21 No.2, pp.57 - 68

Received: 25 May 2021
Accepted: 29 Jan 2022

Published online: 27 Apr 2023 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article