Title: A grammar-based approach for XML schema extraction and heterogeneous document integration

Authors: Prudhvi Janga; Karen C. Davis

Addresses: Amazon, Seattle, Washington, USA; Miami University, Oxford, Ohio, USA ' Amazon, Seattle, Washington, USA; Miami University, Oxford, Ohio, USA

Abstract: The availability of vast amounts of heterogeneous XML web data motivates finding efficient methods to search, integrate, query, and present this data. The structure of XML documents is useful for achieving these tasks; however, not every XML document on the web includes a schema. We discuss challenges and solutions in the area of generation and integration of XML schemas. We propose and implement a framework for efficient schema extraction and integration from heterogeneous XML document collections collected from the web. Our approach introduces the schema extended context-free grammar (SECFG) to model XML schemas, including detection of attributes, data types, and element occurrences. Unlike other implementations, our approach supports the generation of XML schemas in any XML schema language, e.g., DTD or XSD. We compare our approach with other proposed approaches and conclude that we offer the same or better functionality more efficiently and with greater flexibility. The approach we propose is flexible enough to facilitate integration of and translation to tabular (relational) data.

Keywords: XML schema; schema integration; schema extraction; schema discovery.

DOI: 10.1504/IJDMMM.2019.100385

International Journal of Data Mining, Modelling and Management, 2019 Vol.11 No.3, pp.235 - 258

Published online: 28 Jun 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article