Title: A constrained crawling approach and its application to a specialised search engine

Authors: Mehdi Adda

Addresses: Department of Computer Science, Engineering and Mathematics, University of Quebec at Rimouski, 300, allee des Ursulines, C.P. 3300, succ. A, Rimouski, Quebec, G5L 3A1, Canada

Abstract: In this paper, we present an approach to crawl and parse websites based on their logical structure rather than on an aleatory exploration method. In this approach, we use a set of constraints to identify web pages and their components. To enforce these constraints, we present a set of primitives that rely on predicate verification. Our model has the attractiveness of being flexible to reflect tree-like logical structures of websites, thus it avoids the need to use complex information analysis and content classification techniques. Furthermore, because the model is implemented as a domain specific language (DSL), describing crawling tasks is straightforward. Using this DSL, we developed and deployed a prototype of dynamic web application with full-text search capabilities that periodically crawls, parses, and analyses the content of selected online newspapers. A set of experiments, and comparisons highlight the effectiveness of the proposed crawling approach.

Keywords: information retrieval; web crawling; search engines; web pages; web page identification; full-text search; online newspapers; parsing; constraints; constrained crawling.

DOI: 10.1504/IJICT.2011.041928

International Journal of Information and Communication Technology, 2011 Vol.3 No.3, pp.258 - 273

Published online: 21 Oct 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article