Title: Statistical comparison of classifiers for script identification from multi-script handwritten documents

Authors: Pawan Kumar Singh; Ram Sarkar; Nibaran Das; Subhadip Basu; Mita Nasipuri

Addresses: Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India ' Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India ' Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India ' Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India ' Department of Computer Science and Engineering, Jadavpur University, 188, Raja S.C. Mullick Road, Kolkata-700032, West Bengal, India

Abstract: Script identification for handwritten document image is an open document analysis problem especially for multilingual optical character recognition (OCR) system. To design the OCR system for multi-script document pages, it is essential to recognise different scripts before running a particular OCR system of a script. The present work reports an intelligent feature-based technique for word-level script identification in multi-script handwritten document pages. At first, the text lines and then the words are extracted from the document pages. A set of 39 distinctive features have been designed of which eight features are topological and the rest (31) are based on convex hull for each word image. For selection of a suitable classifier, performances of multiple classifiers are evaluated with the designed feature set on multiple subsets of freely available database CMATERdb1.5.1 (http://www.code.google.com/p/cmaterdb), which comprises of 150 handwritten document pages containing both Devnagari and Roman script words. Statistical significance tests on these performance measures declare MLP to be the best performing one. The overall word-level script identification accuracy with MLP classifier on the said database is observed as 99.74%.

Keywords: script identification; multilingual handwritten pages; optical character recognition; multilingual OCR; statistical significance tests; convex hull features; handwritten documents; handwritten document images; classifiers; classifier evaluation.

DOI: 10.1504/IJAPR.2014.063741

International Journal of Applied Pattern Recognition, 2014 Vol.1 No.2, pp.152 - 172

Received: 22 May 2013
Accepted: 30 Aug 2013

Published online: 20 Jul 2014 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article