Authors: Giovanni Siragusa; Luigi Di Caro; Marco Tosalli
Addresses: Department of Computer Science, University of Turin, Turin, Italy ' Department of Computer Science, University of Turin, Turin, Italy ' Nuance Communication Inc., Strada del Lionetto, 6, Turin, Italy
Abstract: Many natural language understanding tasks require clean input textual data in order to train systems with the highest precision. Such data, usually collected from surveys or the web, are manually processed in order to remove morphosyntactic variability, spelling errors and incoherence in naming entities. Since these operations are conducted by domain experts and annotators, they are usually costly and time-consuming. Furthermore, this scenario is very common in industrial tasks where annotators are hired. In this context, we propose an innovative and simple method that extracts correction patterns, i.e., <expression, replacement> pairs, where expression is a matching string and replacement indicates how to re-write the matched string. Such tool can be used both to evaluate annotators (since it provides a deep understanding of their work) and to automatically revise the texts. We extensively tested our method in a multilingual setting, obtaining outstanding results over baseline approaches.
Keywords: pattern extraction; natural language understanding; annotation learning; correction patterns.
International Journal of Metadata, Semantics and Ontologies, 2019 Vol.13 No.3, pp.254 - 263
Received: 22 Mar 2018
Accepted: 01 Mar 2019
Published online: 23 May 2019 *