Title: A framework to derive web page context from hyperlink structure

Authors: Naresh Chauhan, A.K. Sharma

Addresses: Department of Computer Engineering, YMCA Institute of Engineering, Faridabad, Haryana – 121006, India. ' Department of Computer Engineering, YMCA Institute of Engineering, Faridabad, Haryana – 121006, India

Abstract: Since an anchor is used in an HTML document to point to a related document/picture/media application, anchor-text becomes a potential resource to extract the information about an associated web page. However, sometimes anchor-texts are either not present at all or a single word text/an image anchor is contained in the anchor tag. In these situations, the text surrounding a link or the link-context assumes importance in the sense that it can be used to derive the context of the target web page. In this paper, a dataset of about 100 web pages of different categories from Open Directory Project (ODP) has been surveyed and analysed. The results show that cohesive text surrounding the anchor in the form of full sentences and non-cohesive text present elsewhere in the in-link web pages provides rich semantic information about a target web page, which in turn can be considered as the context of the target web page. Since, generally, there are several in-links for a target web page, a filtering mechanism, based on the linguistic analysis of all context-sentences, which filters the best described context sentence, has been developed and is being described and evaluated in this paper.

Keywords: anchor text; hyperlinks; link-context; in-links; out-links; cohesive-text; non-cohesive-text; semantic information; web page context; hyperlink structure; filtering mechanisms.

DOI: 10.1504/IJICT.2008.024006

International Journal of Information and Communication Technology, 2008 Vol.1 No.3/4, pp.329 - 346

Published online: 23 Mar 2009 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article