Title: Critical review of various near-duplicate detection methods in web crawl and their prospective application in drug discovery

Authors: Lavanya Pamulaparty; C.V. Guru Rao; M. Sreenivasa Rao

Addresses: Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad, Hyderabad, India ' Department of Computer Science and Engineering, SR Engineering College, Warangal, India ' School of Informatics, Jawaharlal Nehru Technological University Hyderabad, Hyderabad, India

Abstract: For near-duplicate detection, various methods available in the literature are compared in terms of their application, utility, and context. In most of the cases the performances are highlighted so that anyone interested in choosing an algorithm can find this useful. Moreover, certain futuristic algorithms like oblique and streaming random forest are reported which will help the researcher to develop new algorithms especially suitable for Big Data and cloud environment. The coverage is not exhaustive but, nevertheless, considers all important algorithms used in practice so that any practitioner can find it handy to take implementation decision. As application case study application of random forest approach to near-duplicate detection is used in Chinese herbal drug discovery application is proposed.

Keywords: near-duplicate detection; big data; cloud environment; web crawling; random forest.

DOI: 10.1504/IJBET.2017.087723

International Journal of Biomedical Engineering and Technology, 2017 Vol.25 No.2/3/4, pp.212 - 226

Received: 15 Oct 2016
Accepted: 01 Feb 2017

Published online: 31 Oct 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article