Title: Writing type, script and language identification in heterogeneous documents

Authors: Anis Mezghani; Fouad Slimane; Monji Kherallah

Addresses: Research Groups on Intelligent Machines Laboratory, University of Sfax, Sfax 3038, Tunisia ' MEDIA Research Laboratory, Swiss Federal Institute of Technology in Lausanne (EPFL), Lausanne 1015, Switzerland ' Research Groups on Intelligent Machines Laboratory, University of Sfax, Sfax 3038, Tunisia

Abstract: In this paper, we propose a writing type, script and language text classification method to automatically determine the identity of texts segmented from heterogeneous document images. These documents are written in Arabic, French and English languages with mixed machine-printed and handwritten text. To handle such a problem, we treat each text-line/word image with a fixed-length sliding window. Each window is represented with 23 simple and efficient features to achieve the writing type and the script identification goal using Gaussian mixture models (GMM). The proposed approach for language identification is based on a bi-gram analysis of an optical character recognition (OCR) output. Experiments have been conducted with handwritten and machine-printed text-blocks, text-lines and words extracted from the Maurdor database. The results reveal the feasibility of our proposed method in writing type, script and language identification.

Keywords: heterogeneous documents; writing type identification; script and language identification; features; GMMs; Gaussian mixture models; recognition; sliding window; bi-gram analysis.

DOI: 10.1504/IJISTA.2017.085358

International Journal of Intelligent Systems Technologies and Applications, 2017 Vol.16 No.3, pp.225 - 245

Received: 28 Jul 2016
Accepted: 05 Jan 2017

Published online: 24 Jul 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article