Title: Accelerating the process of web page segmentation via template clustering

Authors: Jan Zeleny; Radek Burget

Addresses: Faculty of Information Technology, Brno University of Technology, Brno, Czech Republic ' Faculty of Information Technology, Brno University of Technology, IT4Innovations Centre of Excellence, Brno, Czech Republic

Abstract: Page segmentation is often one of the initial steps when performing data mining on a web page. In the past years, several methods of page segmentation have been developed that are based on visual perception of the web page. In this paper, we propose a generic method for improving efficiency of virtually all vision-based segmentation algorithms. Our method called cluster-based page segmentation takes the widely spread concept of web templates and utilises it for improving the efficiency of vision-based page segmentation by clustering web pages and performing the segmentation on the clusters instead of each page in the cluster. To prove the efficiency of our algorithm, we offer experimental results gathered using three different vision-based segmentation algorithms.

Keywords: VIPS; page segmentation; vision-based page segmentation; web page segmentation; web page preprocessing; segmentation performance; template detection; template clustering; data mining; visual perception; web templates.

DOI: 10.1504/IJIIDS.2016.075424

International Journal of Intelligent Information and Database Systems, 2016 Vol.9 No.2, pp.134 - 154

Available online: 22 Mar 2016 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article