Title: Detecting duplicate biological entities using Shortest Path Edit Distance

Authors: Alex Rudniy, Min Song, James Geller

Addresses: Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA. ' Department of Information Systems, New Jersey Institute of Technology, Newark, NJ 07102, USA. ' Department of Computer Science, New Jersey Institute of Technology, Newark, NJ 07102, USA

Abstract: Duplicate entity detection in biological data is an important research task. In this paper, we propose a novel and context-sensitive Shortest Path Edit Distance (SPED) extending and supplementing our previous work on Markov Random Field-based Edit Distance (MRFED). SPED transforms the edit distance computational problem to the calculation of the shortest path among two selected vertices of a graph. We produce several modifications of SPED by applying Levenshtein, arithmetic mean, histogram difference and TFIDF techniques to solve subtasks. We compare SPED performance to other well-known distance algorithms for biological entity matching. The experimental results show that SPED produces competitive outcomes.

Keywords: biological entity matching; SPED; shortest path edit distance; histogram matching; duplicate record detection; text mining; Levenshtein; bioinformatics.

DOI: 10.1504/IJDMB.2010.034196

International Journal of Data Mining and Bioinformatics, 2010 Vol.4 No.4, pp.395 - 410

Published online: 17 Jul 2010 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article