Title: A genetic programming-based approach and machine learning approaches to the classification of multiclass anti-malarial datasets

Authors: Madhulata Kumari; Neeraj Tiwari; Naidu Subbarao

Addresses: Department of Information Technology, Kumaun University, S.S.J Campus, Almora, Uttarakhand, 263601, India ' Department of Statistics, Kumaun University, S.S.J. Campus Almora, Uttarakhand, 263601, India ' School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, 110067, India

Abstract: Feature selection approaches have been widely applied to deal with the various sample size problem in the classification of activity of datasets. The present work focuses on the understanding system of descriptors of anti-malarial inhibitors by Genetic programming (GP) to understand the impact of descriptors on inhibitory effects. The experimental dataset of inhibitors of anti-malarial was used to derive the optimised system by GP. Additionally, we have developed machine learning models using the random forest, decision tree, support vector machine (SVM) and Naive Bayes on an antimalarial dataset obtained from ChEMBL database and evaluated for their predictive capability. Based on the statistical evaluation, Random Forest model showed the higher area under the curve (AUC), better accuracy, sensitivity, and specificity in the cross-validation tests as compared to others. The statistical results indicated that the RF model was the best predictive model with 82.51% accuracy, 89.7% ROC. We deployed the RF classifier model on three datasets; phytochemical compound dataset, NCI natural product dataset IV and approved drugs dataset containing 918, 423 and 1554 compounds resulting 153, 81 and 250 compounds respectively as anti-malarial compounds. Further, to prioritise drug-like compounds, Lipinski's rule was applied on active phytochemicals which resulted in 13 hit anti-malarial molecules. Thus, such predictive models are useful to find out novel hit anti-malarial compounds and could also be used to discover novel drugs for other diseases.

Keywords: machine learning approaches; data mining; random forest; SVM; support vector machine; Naïve Bayes; decision tree; malaria; phytochemical; natural product.

DOI: 10.1504/IJCBDD.2018.096125

International Journal of Computational Biology and Drug Design, 2018 Vol.11 No.4, pp.275 - 294

Received: 20 Jun 2017
Accepted: 31 Oct 2017

Published online: 13 Nov 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article