Title: Mining the web with hierarchical crawlers – a resource sharing based crawling approach

Authors: Anirban Kundu, Ruma Dutta, Rana Dattagupta, Debajyoti Mukhopadhyay

Addresses: Netaji Subhash Engineering College, West Bengal University of Technology, West Bengal-700 152, India; Web Intelligence & Distributed Computing Research Lab, (WIDiCoReL), Green Tower C-9/1, Golf Green, Calcutta-700095, India. ' Netaji Subhash Engineering College, West Bengal University of Technology, West Bengal-700 152, India; Web Intelligence & Distributed Computing Research Lab, (WIDiCoReL), Green Tower C-9/1, Golf Green, Calcutta-700095, India. ' Jadavpur University, West Bengal-700 032, India. ' Calcutta Business School, Diamond Harbour Road, Bishnupur, West Bengal-743 503, India; Web Intelligence & Distributed Computing Research Lab, (WIDiCoReL), Green Tower C-9/1, Golf Green, Calcutta-700095, India

Abstract: An important component of any web search engine is its crawler, which is also known as robot or spider. An efficient set of crawlers make any search engine more powerful, apart from its other measures of performance, such as its ranking algorithm, storage mechanism, indexing techniques, etc. In this paper, we have proposed an extended technique for crawling over the World Wide Web (WWW) on behalf of a search engine. This is an approach with multiple crawlers working in parallel combined with the mechanism of focused crawling (Chakrabarti et al., 1999a, 2002; Mukhopadhyay et al., 2006). In this approach, the total structure of any website is divided into several number of levels based on the hyperlink-structure for downloading web pages from that website (Chakrabarti et al., 1999b; Mukhopadhyay and Singh, 2004). The number of crawlers of each level is not fixed, rather dynamic in this context. It is determined at execution time on demand basis using threaded program based on the number of hyperlinks of a specific web page. This paper also proposes a focused hierarchical crawling technique, where crawlers are created dynamically at runtime for different domains to crawl the web pages with the essence of resource sharing.

Keywords: seed queues; single crawlers; parallel crawlers; hierarchical crawlers; focused crawlers; domain specific crawlers; resource sharing; web mining; web search engines; world wide web; hyperlinks; web page crawling.

DOI: 10.1504/IJIIDS.2009.023040

International Journal of Intelligent Information and Database Systems, 2009 Vol.3 No.1, pp.90 - 106

Published online: 08 Feb 2009 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article