Authors: João Antonio Silva; Denilson Alves Pereira
Addresses: Department of Computer Science, Universidade Federal de Lavras, P.O. Box 3037, 37.200-000, Lavras, Brazil ' Department of Computer Science, Universidade Federal de Lavras, P.O. Box 3037, 37.200-000, Lavras, Brazil
Abstract: Several web applications maintain data repositories containing references to thousands of real-world entities originating from multiple sources, and they continually receive new data. Identifying the distinct entities and associating the correct references to each one is a problem known as entity resolution. The challenge is to solve the problem incrementally, as the data arrive, especially when those data are described by a single textual attribute. In this paper, we propose a new approach for incremental entity resolution. The method we have implemented, called AssocIER, uses an ensemble of multiclass classifiers with self-training and detection of novel classes. We have evaluated our method in various real-world datasets and scenarios, comparing it with a traditional entity resolution approach. The results show that AssocIER is effective and efficient to solve unstructured data in collections with a large number of entities and features, and is able to detect hundreds of novel classes.
Keywords: entity resolution; associative classification; incremental learning; novel class detection; self-training.
International Journal of Business Intelligence and Data Mining, 2021 Vol.18 No.2, pp.218 - 245
Received: 21 Mar 2018
Accepted: 28 Jun 2018
Published online: 15 Feb 2021 *