Title: OntoMiner: automated metadata and instance mining from news websites

Authors: Hasan Davulcu, Srinivas Vadrevu, Saravanakumar Nagarajan

Addresses: Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287-8809, USA. ' Department of Computer Science and Engineering, Arizona State University, Tempe, AZ 85287-8809, USA. ' Convera Corporation, 1808 Aston Avenue, Carlsbad, CA-92008, USA

Abstract: RDF/XML has been widely recognised as the standard for annotating online web documents and for transforming the HTML web into the so-called Semantic Web. In order to enable widespread usability of the Semantic Web, there is a need to bootstrap large, rich and up-to-date domain ontologies that organise the most relevant concepts, their relationships and instances. In this paper, we present automated techniques for bootstrapping and populating specialised domain ontologies by organising and mining a set of relevant overlapping websites. We develop algorithms that detect and utilise HTML regularities in the web documents to turn them into hierarchical semantic structures encoded as XML. Next, we present tree-mining algorithms that identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We also report experimental evaluation for the news, travel and shopping domains to demonstrate the efficacy of our algorithms.

Keywords: instance ontology; metadata mining; instance mining; news websites; semantic web; domain ontologies; bootstrapping; web information retrieval; data mining; document information retrieval; web search; travel websites; shopping websites; web documents; taxonomy directed websites.

DOI: 10.1504/IJWGS.2005.008320

International Journal of Web and Grid Services, 2005 Vol.1 No.2, pp.196 - 221

Published online: 02 Dec 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article