Authors: Mehdi Adda
Addresses: Department of Computer Science, Engineering and Mathematics, University of Quebec at Rimouski, 300, allee des Ursulines, C.P. 3300, succ. A, Rimouski, Quebec, G5L 3A1, Canada
Abstract: In this paper, we present an approach to crawl and parse websites based on their logical structure rather than on an aleatory exploration method. In this approach, we use a set of constraints to identify web pages and their components. To enforce these constraints, we present a set of primitives that rely on predicate verification. Our model has the attractiveness of being flexible to reflect tree-like logical structures of websites, thus it avoids the need to use complex information analysis and content classification techniques. Furthermore, because the model is implemented as a domain specific language (DSL), describing crawling tasks is straightforward. Using this DSL, we developed and deployed a prototype of dynamic web application with full-text search capabilities that periodically crawls, parses, and analyses the content of selected online newspapers. A set of experiments, and comparisons highlight the effectiveness of the proposed crawling approach.
Keywords: information retrieval; web crawling; search engines; web pages; web page identification; full-text search; online newspapers; parsing; constraints; constrained crawling.
International Journal of Information and Communication Technology, 2011 Vol.3 No.3, pp.258 - 273
Received: 08 May 2021
Accepted: 12 May 2021
Published online: 15 Aug 2011 *