Title: Intelligent typo correction for text mining through machine learning

Authors: Yinghao Huang; Yi Lu Murphey; Yao Ge

Addresses: CommVault Systems, Inc., 1 CommVault Way, Tinton Falls, NJ 07724, USA ' Department of Electrical and Computer Engineering, University of Michigan-Dearborn, 4901 Evergreen Rd, Dearborn, MI 48128, USA ' Ford Motor Company, 2101 Village Rd, Dearborn, MI 48124, USA

Abstract: Typo detection and correction is an important process in many text mining applications. This research focuses on automatic typo detection and correction for processing text documents that are unstructured, contain many grammar and spelling errors, and have many self-invented terminologies that can be interpreted only through domain-specific knowledge. In this paper we present an intelligent typo detection and correction (ITDC) system. Its 'intelligence' is reflected by automatically identifying and accurately correcting a broad range of typos, from simple typos such as duplication, omission, transposition, substitution characters, to complex spelling errors, such as word boundary errors, unconventional use of acronyms, etc. ITDC utilises general language knowledge and domain-specific knowledge extracted by machine learning algorithms. It is evaluated through a case study that involves the automatic processing of automotive fault diagnostic text documents. The experiment results show that the proposed system outperforms some of the state-of-art spell checking systems.

Keywords: machine learning; neural networks; spelling errors; text processing; text mining; intelligent correction; typo correction; typos; typographical errors; typo detection; grammatical errors; text documents; self-invented terminology; spell checking; language knowledge; domain-specific knowledge; machine learning; case study; automotive fault diagnosis; automobile industry.

DOI: 10.1504/IJKEDM.2015.071290

International Journal of Knowledge Engineering and Data Mining, 2015 Vol.3 No.2, pp.115 - 142

Received: 13 Mar 2014
Accepted: 08 Mar 2015

Published online: 19 Aug 2015 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article