Title: Research on deduplication method of multiple relations based on hierarchical clustering algorithm

Authors: Ying Wang; Weiwei Cheng; Chang Liu

Addresses: Department of Mathematics and Physics, Harbin Far East Institute of Technology, Harbin 150040, China ' Department of Basic Sciences, Qiqihar Institute of Engineering, Qiqihar 150040, China ' Department of Mathematics and Physics, Harbin Far East Institute of Technology, Harbin 150040, China

Abstract: In order to overcome the problems of low efficiency and large error in traditional data deduplication methods, a multi relational data deduplication method based on hierarchical clustering algorithm is proposed. According to the inter class relationship information of duplicate data, different types of closely related class clusters are merged. Through hierarchical clustering algorithm, all the duplicate data are clustered according to the data similarity. After finding the similar class in the first level index, the super eigenvalue is used to complete the detection of multi relationship duplicate data. According to the specific situation at that time, the detected duplicate data is deleted by automatic, semi-automatic or manual methods. Experimental results show that the method has low error rate and good deletion effect, and improves the efficiency of multi relational data deduplication, with the highest deletion rate of 99%.

Keywords: hierarchical clustering; multi relational data repetition; super eigenvalue; inter class relationship.

DOI: 10.1504/IJICT.2023.128741

International Journal of Information and Communication Technology, 2023 Vol.22 No.2, pp.105 - 116

Received: 08 Dec 2020
Accepted: 15 Jan 2021

Published online: 02 Feb 2023 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article