Title: Proposal and study of statistical features for string similarity computation and classification

Authors: E.O. Rodrigues; D. Casanova; M. Teixeira; V. Pegorini; F. Favarim; E. Clua; A. Conci; Panos Liatsis

Addresses: Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil ' Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil ' Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil ' Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil ' Academic Department of Informatics, Universidade Tecnológica Federal do Paraná (UTFPR), Paraná, Brazil ' Department of Computer Science, Universidade Federal Fluminense (UFF), Rio de Janeiro, Brazil ' Department of Computer Science, Universidade Federal Fluminense (UFF), Rio de Janeiro, Brazil ' Department of Electrical Engineering and Computer Science, Khalifa University, Abu Dhabi, UAE

Abstract: Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

Keywords: word comparison; string similarity; classification; statistical features; text mining; optical character recognition; OCR; text plagiarism; text entailment; supervised learning.

DOI: 10.1504/IJDMMM.2020.108731

International Journal of Data Mining, Modelling and Management, 2020 Vol.12 No.3, pp.277 - 307

Received: 24 Jul 2019
Accepted: 18 Mar 2020

Published online: 23 Jul 2020 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article