Title: Critical review of various near-duplicate detection methods in web crawl and their prospective application in drug discovery
Authors: Lavanya Pamulaparty; C.V. Guru Rao; M. Sreenivasa Rao
Addresses: Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad, Hyderabad, India ' Department of Computer Science and Engineering, SR Engineering College, Warangal, India ' School of Informatics, Jawaharlal Nehru Technological University Hyderabad, Hyderabad, India
Abstract: For near-duplicate detection, various methods available in the literature are compared in terms of their application, utility, and context. In most of the cases the performances are highlighted so that anyone interested in choosing an algorithm can find this useful. Moreover, certain futuristic algorithms like oblique and streaming random forest are reported which will help the researcher to develop new algorithms especially suitable for Big Data and cloud environment. The coverage is not exhaustive but, nevertheless, considers all important algorithms used in practice so that any practitioner can find it handy to take implementation decision. As application case study application of random forest approach to near-duplicate detection is used in Chinese herbal drug discovery application is proposed.
Keywords: near-duplicate detection; big data; cloud environment; web crawling; random forest.
International Journal of Biomedical Engineering and Technology, 2017 Vol.25 No.2/3/4, pp.212 - 226
Received: 15 Oct 2016
Accepted: 01 Feb 2017
Published online: 23 Oct 2017 *