Title: QSJoin: a new string similarity join method based on Q-sample and statistical features

Authors: Xiaoxia Wang; Decai Sun; Bo Wu; Puzhao Ji

Addresses: College of Information Science and Technology, Bohai University, Jinzhou, China ' College of Information Science and Technology, Bohai University, Jinzhou, China ' Dell EMC, No. 2, Hanzhong Road, Nanjing, China ' School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China

Abstract: Similarity joins is an essential operation in big data analytics, such as data integration and data cleaning. In this paper, we propose a new algorithm, called QSJoin, to support efficient string similarity join by reducing the shuffle cost and transmission cost in MapReduce. Our algorithm employs a filter-verify framework. In filtration, a new signature scheme based on q-sample is adopted to decrease the number of generated signatures, and then a large number of dissimilar pairs are discarded with Standard-Match filter. In verification, a multi-vector filter scheme is adopted to eliminate more dissimilar pairs with statistical features, and then the final true pairs is extracted by the verification of candidate pairs with length-aware verification method. Experimental result on real-world datasets shows that our algorithm achieves high performance and outperforms state-of-the-art approaches.

Keywords: string similarity join; MapReduce; Q-sample; statistical feature; data integration.

DOI: 10.1504/IJART.2019.100429

International Journal of Arts and Technology, 2019 Vol.11 No.3, pp.285 - 308

Received: 11 Oct 2018
Accepted: 26 Nov 2018

Published online: 28 Jun 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article