Authors: Tyrone Cadenhead, Jinlin Chen, Terry Cook
Addresses: Department of Computer Science, University of Dallas, Texas, USA. ' Department of Computer Science, Queens College, City University of New York, Flushing, NY 11367, USA. ' Department of Computer Science, Graduate Centre, City University of New York, New York 10016, USA
Abstract: Duplicated information in today|s Web has serious negative impact to Web search engines in that it increases the size of the index and results in low efficiency for Web information retrieval. One important fact is that a large amount of Web content duplication happens at block level in addition to site and page level due to various reasons. Besides, when searching through the Web, in most cases the desired information is located at the center block of a relevant page. Based on these two observations, we propose an efficient block level duplication detection algorithm based on resemblance transitivity, and index center blocks instead of entire Web pages for Web information retrieval. Experiments show that these strategies can effectively reduce index size and index construction time without sacrificing the effectiveness of Web information retrieval.
Keywords: duplication detection; inverted index; layout structure detection; information retrieval; web information; information indexing; internet; center block; resemblance transitivity.
International Journal of Innovative Computing and Applications, 2008 Vol.1 No.3, pp.194 - 204
Available online: 20 Jul 2008 *Full-text access for editors Access for subscribers Purchase this article Comment on this article