Article: Critical review of various near-duplicate detection methods in web crawl and their prospective application in drug discovery Journal: International Journal of Biomedical Engineering and Technology (IJBET) 2017 Vol.25 No.2/3/4 pp.212 - 226 Abstract: For near-duplicate detection, various methods available in the literature are compared in terms of their application, utility, and context. In most of the cases the performances are highlighted so that anyone interested in choosing an algorithm can find this useful. Moreover, certain futuristic algorithms like oblique and streaming random forest are reported which will help the researcher to develop new algorithms especially suitable for Big Data and cloud environment. The coverage is not exhaustive but, nevertheless, considers all important algorithms used in practice so that any practitioner can find it handy to take implementation decision. As application case study application of random forest approach to near-duplicate detection is used in Chinese herbal drug discovery application is proposed. Inderscience Publishers - linking academia, business and industry through research

Title: Critical review of various near-duplicate detection methods in web crawl and their prospective application in drug discovery

Authors: Lavanya Pamulaparty; C.V. Guru Rao; M. Sreenivasa Rao

Addresses: Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Hyderabad, Hyderabad, India ' Department of Computer Science and Engineering, SR Engineering College, Warangal, India ' School of Informatics, Jawaharlal Nehru Technological University Hyderabad, Hyderabad, India

Abstract: For near-duplicate detection, various methods available in the literature are compared in terms of their application, utility, and context. In most of the cases the performances are highlighted so that anyone interested in choosing an algorithm can find this useful. Moreover, certain futuristic algorithms like oblique and streaming random forest are reported which will help the researcher to develop new algorithms especially suitable for Big Data and cloud environment. The coverage is not exhaustive but, nevertheless, considers all important algorithms used in practice so that any practitioner can find it handy to take implementation decision. As application case study application of random forest approach to near-duplicate detection is used in Chinese herbal drug discovery application is proposed.

Keywords: near-duplicate detection; big data; cloud environment; web crawling; random forest.

DOI: 10.1504/IJBET.2017.087723

International Journal of Biomedical Engineering and Technology, 2017 Vol.25 No.2/3/4, pp.212 - 226

Received: 15 Oct 2016
Accepted: 01 Feb 2017
Published online: 31 Oct 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Critical review of various near-duplicate detection methods in web crawl and their prospective application in drug discovery

Keep up-to-date