Title: Writing type, script and language identification in heterogeneous documents
Authors: Anis Mezghani; Fouad Slimane; Monji Kherallah
Addresses: Research Groups on Intelligent Machines Laboratory, University of Sfax, Sfax 3038, Tunisia ' MEDIA Research Laboratory, Swiss Federal Institute of Technology in Lausanne (EPFL), Lausanne 1015, Switzerland ' Research Groups on Intelligent Machines Laboratory, University of Sfax, Sfax 3038, Tunisia
Abstract: In this paper, we propose a writing type, script and language text classification method to automatically determine the identity of texts segmented from heterogeneous document images. These documents are written in Arabic, French and English languages with mixed machine-printed and handwritten text. To handle such a problem, we treat each text-line/word image with a fixed-length sliding window. Each window is represented with 23 simple and efficient features to achieve the writing type and the script identification goal using Gaussian mixture models (GMM). The proposed approach for language identification is based on a bi-gram analysis of an optical character recognition (OCR) output. Experiments have been conducted with handwritten and machine-printed text-blocks, text-lines and words extracted from the Maurdor database. The results reveal the feasibility of our proposed method in writing type, script and language identification.
Keywords: heterogeneous documents; writing type identification; script and language identification; features; GMMs; Gaussian mixture models; recognition; sliding window; bi-gram analysis.
DOI: 10.1504/IJISTA.2017.085358
International Journal of Intelligent Systems Technologies and Applications, 2017 Vol.16 No.3, pp.225 - 245
Received: 28 Jul 2016
Accepted: 05 Jan 2017
Published online: 24 Jul 2017 *