Title: Understanding web documents: finding pagelets for transformation using structural patterns

Authors: Reza Ferrydiansyah, Bambang Parmanto

Addresses: Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, 6025 Forbes Tower, HIM, Pittsburgh, PA 15260, USA. ' Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, 6026 Forbes Tower, HIM, Pittsburgh, PA 15260, USA

Abstract: Understanding a web document and the sections inside the document is very important for web transformation and information retrieval from web pages. Detecting pagelets, which are small features located inside a web page, in order to understand a web document|s structure is a difficult problem. Current work on pagelet detection focuses only on finding the location of the pagelet without regard to its functionality. We describe a method to detect both the location and functionality of pagelets using HTML element patterns. For each pagelet type, an HTML element pattern is created and matched to a web page. Sections of the web page that matches the patterns are marked as pagelet candidates. We test this technique on multiple popular web pages from the news and e-commerce genres. We find that this method adequately recalls various pagelets from the web page.

Keywords: pagelet detection; segmentation; pattern matching; annotation; transcoding; world wide web; HTML element patterns; web documents; web transformation; information retrieval; document structure.

DOI: 10.1504/IJWET.2008.019537

International Journal of Web Engineering and Technology, 2008 Vol.4 No.3, pp.313 - 335

Published online: 15 Jul 2008 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article