Title: Time-sensitive clustering evolving textual data streams

Authors: Mohamed Ammar; Adel Hidri; Minyar Sassi Hidri

Addresses: National Engineering School of Tunis, Tunis El Manar University, Tunis, Tunisia ' Computer Department, Deanship of Preparatory Year and Supporting Studies, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia ' Computer Department, Deanship of Preparatory Year and Supporting Studies, Imam Abdulrahman Bin Faisal University, Dammam 31441, Saudi Arabia

Abstract: Clustering a stream of text documents is an emerging subject of interest since it is widely used in analysing the content in social media and e-journals. The aim is to find a certain structure for unlabelled data based on a similarity criterion. However, few works have focused on this field and fall in this perspective, that's why a new document clustering approach adapted to a stream of text data and test it on news articles data sets is proposed. A distributed representation of words is used, and a bottom-up approach is used to represent documents as vectors on a unit hyper-sphere. The proposed approach gains its roots from the SPherical k-means (SPKM) algorithm and its underlying mixture of von-Mises Fisher (vMF) distributions. The proposed approach yields comparable results to baseline batch algorithm for stable data streams and superior results for rapidly evolving data streams.

Keywords: natural language processing; document clustering; competitive learning; data streams; machine learning; metadata; scientific data management; data sharing; data integration; computer supported collaborative work.

DOI: 10.1504/IJCAT.2020.107900

International Journal of Computer Applications in Technology, 2020 Vol.63 No.1/2, pp.25 - 40

Received: 24 May 2019
Accepted: 01 Oct 2019

Published online: 30 Jun 2020 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article