Title: Improved LSH-driven string similarity join filtering-verification framework

Authors: Jingwei Zhang; Ru Chen; Qing Yang

Addresses: Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China ' Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin 541004, China ' Guangxi Key Laboratory of Automatic Measurement Technology and Instrument, Guilin University of Electronic Technology, Guilin 541004, China

Abstract: Similarity join is a basic data analysis operation, which is widely used in the fields of similarity search, data cleaning and recommendation application. The filtering-verification framework is one of the main modes to implement similarity join. In view of high-dimensional data and high edit distance threshold, a filtering-verification framework based on locality-sensitive hashing (LSH) is proposed, which adopts dual filtering mode to effectively balance the number of both false positive and false negative, thereby improving the efficiency and accuracy of similarity join. Experimental results show that the similarity join filtering-verification framework based on LSH can effectively reduce the number of false positive, and it has a significant improvement in efficiency compared with the traditional method based on edit distance.

Keywords: similarity join; filtering verification.

DOI: 10.1504/IJIITC.2020.110280

International Journal of Intelligent Internet of Things Computing, 2020 Vol.1 No.2, pp.89 - 99

Received: 27 Nov 2019
Accepted: 03 Dec 2019

Published online: 25 Sep 2020 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article