Authors: Milos Kovacevic, Michelangelo Diligenti, Marco Gori, Veljko Milutinovic
Addresses: Faculty of Civil Engineering, University of Belgrade, Serbia, Yugoslavia. Dipartimento di Ingegneria dell'Informazione, University of Siena, Italy. Dipartimento di Ingegneria dell'Informazione, University of Siena, Italy. Faculty of Electrical Engineering, University of Belgrade, Serbia, Yugoslavia
Abstract: Extracting and processing information from web pages is an important task in many areas such as constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a bag of words and then to perform additional processing on such a flat representation. In this paper, we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object on a page. Using visual information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the centre of a page. Initial experiments have shown that, using our heuristics, defined areas are recognised properly in 73% of cases. Finally, we introduce a classification system which, taking into account the proposed document layout analysis clearly outperforms standard systems by 10% or more.
Keywords: web page classification; focused search engines; page layout; common areas; recognition heuristics; Naive Bayes classifier.
International Journal of Electronic Business, 2003 Vol.1 No.3, pp.310-326
Published online: 23 Jul 2003 *Full-text access for editors Access for subscribers Purchase this article Comment on this article