Inderscience PublishersInderscience PublishersInderscience Publishers About Inderscience Contact Information Current Site Map General Help
  PUBLISHERS OF DISTINGUISHED ACADEMIC, SCIENTIFIC AND PROFESSIONAL JOURNALS

The full text of this article:

Searching for web information more efficiently using presentational layout analysis
by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, Veljko Milutinovic
International Journal of Electronic Business (IJEB), Vol. 1, No. 3, 2003
Abstract: Extracting and processing information from web pages is an important task in many areas such as constructing search engines, information retrieval, and data mining from the web. A common approach in the extraction process is to represent a page as a bag of words and then to perform additional processing on such a flat representation. In this paper, we propose a new, hierarchical representation that includes browser screen coordinates for every HTML object on a page. Using visual information one is able to define heuristics for recognition of common page areas such as a header, left and right menu, footer and the centre of a page. Initial experiments have shown that, using our heuristics, defined areas are recognised properly in 73% of cases. Finally, we introduce a classification system which, taking into account the proposed document layout analysis clearly outperforms standard systems by 10% or more.

is only available to individual subscribers or to users at subscribing institutions.

ATTENTION SUBSCRIBERS:
Please re-direct your browser by clicking on this Inderscience Online Journals link, to access the full-text of this article.

Pay per view: If you are not a Subscriber and you just want to read the full contents of this article, please click here to purchase online access to the full-text of this article. Please allow 3 days + mailing time. Current price for article is Thirty Euros (€30)

Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Electronic Business (IJEB) journal, that have been redirected here, please check if you have a registered username/password subscription with Inderscience. If that is the case, please Login:

    Username:        Password:         Forgotten your Password?

If you are not yet a Subscriber to International Journal of Electronic Business (IJEB) journal, you can subscribe by following a few simple and quick steps. A subscription will give you complete access to all articles in the current issue, as well as to all articles in the previous three years, where applicable. Click here to subscribe.

Should you experience further difficulties or have any enquiries, please email subs@inderscience.com