Title: Reengineering PDF-based documents targeting complex software specifications
Authors: Mehrdad Nojoumian; Timothy C. Lethbridge
Addresses: David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, N2L 3G1, Canada. ' School of Electrical Engineering and Computer Science, 800 King Edward Ave., Ottawa, ON K1N 6N5, Canada
Abstract: We discuss how to reengineer complex PDF-based documents, such as specifications and technical books, so that end users have a better experience with them. Specifications of the object management group (OMG) are our initial targets. Such specifications are dense and intricate to use, and tend to have complicated structures. Our approach includes format conversion, logical structure extraction, text extraction and multi-layer hypertext generation. Logical structure extraction is central, and results in an XML document with a schema tailored to the type of document. Many key concepts of a document are expressed in this schema, including concepts extracted from the patterns of words used in headings. For example in OMG specifications, package relationships and class associations can often be extracted from the wording of headings. When we produce, in the final step, a multilayer hypertext version of the document, these extracted concepts allow a richer user experience.
Keywords: digital libraries; electronic publishing; improving user experiences; browsing interfaces; e-publishing; document reengineering; PDF based documents; software specifications; object management group; OMG specifications; XML documents; format conversion; logical structure extraction; text extraction; multi-layer hypertext generation; package relationships; class associations; document headings; word patterns.
DOI: 10.1504/IJKWI.2011.045165
International Journal of Knowledge and Web Intelligence, 2011 Vol.2 No.4, pp.292 - 319
Published online: 07 Mar 2015 *
Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article