Title: Automatic metadata extraction via image processing using Migne's Patrologia Graeca

Authors: Evagelos Varthis; Marios Poulos; Ilias Giarenis; Sozon Papavlasopoulos

Addresses: Department of Archives, Library Science and Museology, Ionian University, Corfu, Greece ' Department of Archives, Library Science and Museology, Ionian University, Corfu, Greece ' Department of History, Ionian University, Corfu, Greece ' Department of Archives, Library Science and Museology, Ionian University, Corfu, Greece

Abstract: A wealth of knowledge is kept in libraries and cultural institutions in various digital forms without, however, the possibility of a simple term search, let alone of a substantial semantic search. In this study, a novel approach is proposed which strives to recognise words and automatically generate metadata from large machine-printed corpora such as Migne's Patrologia Graeca (PG). The proposed framework firstly applies an efficient word segmentation and then transforms the word-images into special compact shapes. For the comparison, we use Hu's invariant moments for discarding unlikely similar matches, Shape Context (SC) for the contour similarity and the Pearson's Correlation Coefficient (PCC) for final verification. Comparative results are presented by using the Long-Short Term Memory (LSTM) Neural Network (NN) engine of Tesseract Optical Character Recognition (OCR) system instead of PCC. In addition, an intelligent scenario is proposed for automatic generation of PG metadata by librarians.

Keywords: Patrologia Graeca; word spotting; shape context; time series; metadata extraction; semantic enrichment; digital librarian.

DOI: 10.1504/IJMSO.2020.115434

International Journal of Metadata, Semantics and Ontologies, 2020 Vol.14 No.4, pp.265 - 278

Received: 30 May 2020
Accepted: 05 Oct 2020

Published online: 02 Jun 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article