Title: Improving web information indexing and retrieval based on center block duplication detection

Authors: Tyrone Cadenhead, Jinlin Chen, Terry Cook

Addresses: Department of Computer Science, University of Dallas, Texas, USA. ' Department of Computer Science, Queens College, City University of New York, Flushing, NY 11367, USA. ' Department of Computer Science, Graduate Centre, City University of New York, New York 10016, USA

Abstract: Duplicated information in today|s Web has serious negative impact to Web search engines in that it increases the size of the index and results in low efficiency for Web information retrieval. One important fact is that a large amount of Web content duplication happens at block level in addition to site and page level due to various reasons. Besides, when searching through the Web, in most cases the desired information is located at the center block of a relevant page. Based on these two observations, we propose an efficient block level duplication detection algorithm based on resemblance transitivity, and index center blocks instead of entire Web pages for Web information retrieval. Experiments show that these strategies can effectively reduce index size and index construction time without sacrificing the effectiveness of Web information retrieval.

Keywords: duplication detection; inverted index; layout structure detection; information retrieval; web information; information indexing; internet; center block; resemblance transitivity.

DOI: 10.1504/IJICA.2008.019687

International Journal of Innovative Computing and Applications, 2008 Vol.1 No.3, pp.194 - 204

Published online: 20 Jul 2008 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article