Open Access Article

Title: A distributed two-stage clustering method based on node sampling

Authors: Baolong Zhang; Haiyan Huang

Addresses: Office of Development Planning and Quality Evaluation, Jiyuan Vocational and Technical College, Jiyuan, 459000, China ' School of Artificial Intelligence, Jiyuan Vocational and Technical College, Jiyuan, 459000, China

Abstract: To address the issues of high computational resource consumption and low clustering efficiency in big data clustering, this paper first proposes the density deviation sampling improvement algorithm (EDDS). Then, each cluster node independently performs clustering on a subset of the big data to generate initial local clustering results. Next, using the EDDS algorithm on each node, representative data subsets are extracted, and these subsets are aggregated into a sample set that reflects the characteristics of the entire big dataset. Finally, further clustering analysis is performed on this sample set. By integrating the local clustering information from each node using the clustering results, a comprehensive clustering result for the entire big dataset is output. Experimental results demonstrate that, compared to traditional clustering methods, the suggested approach effectively combines the efficiency of parallel processing with the accuracy of integrated analysis.

Keywords: big data clustering; distributed computing; density deviation sampling; node sampling; two-stage clustering.

DOI: 10.1504/IJICT.2025.148824

International Journal of Information and Communication Technology, 2025 Vol.26 No.34, pp.100 - 115

Received: 10 Jun 2025
Accepted: 27 Jun 2025

Published online: 26 Sep 2025 *