Title: Finding and classifying web units in websites

Authors: Aixin Sun, Ee-Peng Lim

Addresses: School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, 639798, Singapore. ' School of Computer Engineering, Nanyang Technological University, Nanyang Avenue, 639798, Singapore

Abstract: In web classification, most researchers assume that the objects to be classified are individual web pages from one or more websites. In practice, the assumption is too restrictive since a web page itself may not carry sufficient information for it to be treated as an instance of some semantic class or concept. In this paper, we relax this assumption and allow a subgraph of web pages to represent an instance of the semantic concept. Such a subgraph of web pages is known as a web unit. To construct and classify web units, we formulate the web unit mining problem and propose an iterative web unit mining (iWUM) method. The iWUM method first finds subgraphs of web pages using knowledge about website structure and connectivity among the web pages. From these web subgraphs, web units are constructed and classified into categories in an iterative manner. Our experiments using the WebKB dataset showed that iWUM was able to construct web units and classify web units with high accuracy for the more structured parts of a website.

Keywords: web units; web unit mining; web classification; web pages; websites; internet; data mining; world wide web.

DOI: 10.1504/IJBIDM.2005.008361

International Journal of Business Intelligence and Data Mining, 2005 Vol.1 No.2, pp.161 - 193

Published online: 08 Dec 2005 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article