Authors: Xiaoxia Wang; Decai Sun; Bo Wu; Puzhao Ji
Addresses: College of Information Science and Technology, Bohai University, Jinzhou, China ' College of Information Science and Technology, Bohai University, Jinzhou, China ' Dell EMC, No. 2, Hanzhong Road, Nanjing, China ' School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China
Abstract: Similarity joins is an essential operation in big data analytics, such as data integration and data cleaning. In this paper, we propose a new algorithm, called QSJoin, to support efficient string similarity join by reducing the shuffle cost and transmission cost in MapReduce. Our algorithm employs a filter-verify framework. In filtration, a new signature scheme based on q-sample is adopted to decrease the number of generated signatures, and then a large number of dissimilar pairs are discarded with Standard-Match filter. In verification, a multi-vector filter scheme is adopted to eliminate more dissimilar pairs with statistical features, and then the final true pairs is extracted by the verification of candidate pairs with length-aware verification method. Experimental result on real-world datasets shows that our algorithm achieves high performance and outperforms state-of-the-art approaches.
Keywords: string similarity join; MapReduce; Q-sample; statistical feature; data integration.
International Journal of Arts and Technology, 2019 Vol.11 No.3, pp.285 - 308
Received: 11 Oct 2018
Accepted: 26 Nov 2018
Published online: 25 Mar 2019 *