QSJoin: a new string similarity join method based on Q-sample and statistical features Online publication date: Fri, 28-Jun-2019
by Xiaoxia Wang; Decai Sun; Bo Wu; Puzhao Ji
International Journal of Arts and Technology (IJART), Vol. 11, No. 3, 2019
Abstract: Similarity joins is an essential operation in big data analytics, such as data integration and data cleaning. In this paper, we propose a new algorithm, called QSJoin, to support efficient string similarity join by reducing the shuffle cost and transmission cost in MapReduce. Our algorithm employs a filter-verify framework. In filtration, a new signature scheme based on q-sample is adopted to decrease the number of generated signatures, and then a large number of dissimilar pairs are discarded with Standard-Match filter. In verification, a multi-vector filter scheme is adopted to eliminate more dissimilar pairs with statistical features, and then the final true pairs is extracted by the verification of candidate pairs with length-aware verification method. Experimental result on real-world datasets shows that our algorithm achieves high performance and outperforms state-of-the-art approaches.
Online publication date: Fri, 28-Jun-2019
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Arts and Technology (IJART):
Login with your Inderscience username and password:
Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.
If you still need assistance, please email email@example.com