Title: An efficient length-segmented inverted index-based set similarity query algorithm

Authors: Mengjuan Li; Lianyin Jia; Juntao Hu; Ruiqi Zhang; Shoulin Wei; Mengni Pan

Addresses: Library, Yunnan Normal University, Kunming, 650500, China ' Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China ' Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China ' Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China ' Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, 650500, China ' Foreign Language Department, Kunming Medical University, Kunming, 650500, China

Abstract: Set Similarity query is extensively used in recommendation systems, near duplicate detection and many other fields. Existing approaches usually scan all relevant inverted lists or trie indexes, thus leading to low efficiency. To solve this problem, in this paper, an efficient length segmentation-based inverted index, LenSegII, is designed. LenSegII imposes a length segmentation table (LST) on inverted indexes by exploiting length filtering, thus logically dividing an inverted index into segments. Based on LenSegII, LenSC-SSQ, an efficient set similarity query algorithm is proposed. LenSC-SSQ only needs to access a certain range of inverted index segments and locates the segments efficiently by LST, thus a large number of unqualified sets can be filtered, which significantly decreases query time. Besides, LenSC-SSQ can improve space efficiency by using a more compact count array. Experiments on 2 real and 1 synthetic datasets show LenSC-SSQ outperforms the other competitors significantly and can reach a performance improvement up to 5.8X over ScanCount.

Keywords: set similarity query; length segmentation; LenSegII; LenSC-SSQ; inverted index; ScanCount; length filtering.

DOI: 10.1504/IJCSM.2022.126796

International Journal of Computing Science and Mathematics, 2022 Vol.16 No.1, pp.85 - 95

Received: 24 Nov 2021
Received in revised form: 27 Apr 2022
Accepted: 16 Jul 2022

Published online: 07 Nov 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article