Title: An evaluation of provenance-based near-duplicates detection

Authors: Y. Syed Mudhasir; J. Deepika; S. Sendhilkumar

Addresses: Department of Computer Science and Engineering, Anna University, Chennai-25, Tamil Nadu, India ' Department of Computer Science and Engineering, Anna University, Chennai-25, Tamil Nadu, India ' Department of Information Science and Technology, Anna University, Chennai-25, Tamil Nadu, India

Abstract: Any existing search engine suffers the problem of redundancy in their search results. Detecting and eliminating such redundancy (near-duplicates) is one thrust area of research conducted widely by many search engine researchers. Provenance-based factors would improve the web search in view of providing beneficial quality content to the user. For users, many factors that affect personalisation may prove to be useful in determining the quality and trust in web documents. Also provenance information is helpful in filtering near duplicates from search results based on 6W factors. Hence this paper is aimed towards developing a web search system using provenance-based technique of near-duplicates detection and elimination. This system incorporates a personalised crawler (focused crawler) for computing author credentials which contributes to the trustworthiness of a web document. Finally, the results of the proposed system are compared with existing algorithms using a test bed of web documents.

Keywords: near-duplicates; provenance information; semantics; trustworthiness; focused crawlers; information retrieval; DTM; document term matrix; provenance matrix; web search; redundancy; search results; personalised crawlers; web documents.

DOI: 10.1504/IJKWI.2011.044122

International Journal of Knowledge and Web Intelligence, 2011 Vol.2 No.2/3, pp.168 - 184

Available online: 09 Dec 2011 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article