Title: Categorisation of web documents using extraction ontologies

Authors: Li Xu, David W. Embley

Addresses: Department of Computer Science, University of Arizona South, 1140 N Colombo Ave., Sierra Vista, AZ 85635, USA. ' Department of Computer Science, Brigham Young University, Provo, Utah, USA

Abstract: Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, our document recognition system extracts expected ontological vocabulary (keywords and keyword phrases) and expected ontological instance data (particular values for ontological concepts). We then use machine-learned rules over this extracted information to determine whether an HTML document contains items of interest. Experimental results show that our ontological approach to categorisation works well, having achieved F-measures above 90% for all applications we tried.

Keywords: document categorisation; web documents; document classification; extraction ontologies; HTML documents; information extraction; machine learning; internet; information retrieval.

DOI: 10.1504/IJMSO.2008.021202

International Journal of Metadata, Semantics and Ontologies, 2008 Vol.3 No.1, pp.3 - 20

Published online: 10 Nov 2008 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article