Inderscience PublishersInderscience PublishersInderscience Publishers
  PUBLISHERS OF DISTINGUISHED ACADEMIC, SCIENTIFIC AND PROFESSIONAL JOURNALS

Article Abstract

Title: Searching for web information more efficiently using presentational layout analysis
  Author: Milos Kovacevic, Michelangelo Diligenti, Marco Gori, Veljko Milutinovic   Email author(s)
  Address: Faculty of Civil Engineering, University of Belgrade, Serbia, Yugoslavia. Dipartimento di Ingegneria dell'Informazione, University of Siena, Italy. Dipartimento di Ingegneria dell'Informazione, University of Siena, Italy. Faculty of Electrical Engineering, University of Belgrade, Serbia, Yugoslavia
  Journal: International Journal of Electronic Business 2003 - Vol. 1, No.3  pp. 310-326
  Abstract: Extracting and processing information from web pages is an important task in many areas such as constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a bag of words and then to perform additional processing on such a flat representation. In this paper, we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object on a page. Using visual information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the centre of a page. Initial experiments have shown that, using our heuristics, defined areas are recognised properly in 73% of cases. Finally, we introduce a classification system which, taking into account the proposed document layout analysis clearly outperforms standard systems by 10% or more.
  Keywords: web page classification; focused search engines; page layout; common areas; recognition heuristics; Naive Bayes classifier.
  DOI: 10.1504/IJEB.2003.002180
  Access for editors and complimentary subscribers       Access for Subscribers   Purchase this Paper        We welcome your comments about this paper Comment on the Paper