Title: Evaluation of predictive models based on random forest, decision tree and support vector machine classifiers and virtual screening of anti-mycobacterial compounds

Authors: Madhulata Kumari; Neeraj Tiwari; Naidu Subbarao; Subhash Chandra

Addresses: Department of Information Technology, Kumaun University, SSJ Campus, Almora, Uttarakhand 263601, India ' Department of Statistics, Kumaun University, SSJ Campus Almora, Uttarakhand 263601, India ' School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi 110067, India ' Department of Botany, Kumaun University, SSJ Campus, Almora, Uttarakhand 263601, India

Abstract: Three machine learning classifiers: random forest, decision tree and support vector machine were used to build predictive models of an anti-mycobacterial ChEMBL database and evaluated for their predictive capability. Before the development of predictive models, data pre-processing was carried out to fix the class imbalance problem by applying cost-sensitive classifier, and filtration of data instance by supervised synthetic minority oversampling technique (SMOTE), spread subsample and resample method. The statistical evaluation indicated that random forest model was the best model as it showed the best accuracy 93.83%, specificity 90.5%, receiver operating characteristic (ROC) 0.984, MCC 0.772 and kappa statistics 0.768 in comparison to other models whereas LibSVM showed the highest sensitivity 94.4% compared with others. Additionally, toxicity predictive models based on SingleCellcall DSSTox carcinogenicity database (AID1189) was developed which resulted in random forest model as the best model. The deployment of both RF predictive models on two unknown datasets resulted in 1317 compounds out of 1554 approved drugs and 2234 compounds out of 18,746 ChEMBL anti-malarial dataset as non-toxic and anti-mycobacterial compounds. Thus machine learning models present highly efficient methods to find out novel hit anti-mycobacterial compounds. We suggest that such machine learning techniques could be very useful to screen drug candidates not only for tuberculosis but also for other diseases.

Keywords: machine learning; random forest; DT; SVM; support vector machine; J48; mycobacterium tuberculosis; drug discovery.

DOI: 10.1504/IJCBDD.2017.085410

International Journal of Computational Biology and Drug Design, 2017 Vol.10 No.3, pp.248 - 263

Received: 04 Jan 2017
Accepted: 21 Mar 2017

Published online: 25 Jul 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article