Title: Exploiting tree structure of a web page for clustering

Authors: Bhaskar Biswas, Karan Jain, Vipul Mittal, K.K. Shukla

Addresses: Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India. ' Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India. ' Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India. ' Department of Computer Engineering, Institute of Technology, Banaras Hindu University, Varanasi 221005, India

Abstract: An approach to designing a Universal Web Wrapper has been in stages of implementation for over years. An issue associated with this is the automated selection of web pages and thereby extraction of content of interest. We propose an algorithm to cluster pages on the basis of their structure. Due to high amount of similarity in these pages, it is be easier to categorise them and extract any particular section of the page. This algorithm makes use of only the structural factors leading to complexity equivalent to O(log n). Further, the algorithm evaluation illustrates the precision and efficiency of the algorithm.

Keywords: universal web wrapper; web wrapper design; DOM; complexity; tree structure; html; formatting; web page clusters; error sum of square; web pages; internet; web page selection; content extraction.

DOI: 10.1504/IJKWI.2009.027926

International Journal of Knowledge and Web Intelligence, 2009 Vol.1 No.1/2, pp.81 - 94

Published online: 19 Aug 2009 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article