Title: Distributed heterogeneous ensemble learning on Apache Spark for ligand-based virtual screening

Authors: Karima Sid; Mohamed Batouche

Addresses: Department of Computer Science, Constantine 2 University-Abdelhamid Mehri, Constantine, Algeria ' Department of Information Technology, CCIS – RC, Princess Nourah University, Riyadh, Saudi Arabia

Abstract: Virtual screening is one of the most common computer-aided drug design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelisation of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models.

Keywords: virtual screening; big data; computer-aided drug design; CADD; Apache Spark; machine learning; drug discovery; ensemble learning; imbalanced datasets; Spark MLlib; ligand-based virtual screening; LBVS.

DOI: 10.1504/IJDMMM.2021.112920

International Journal of Data Mining, Modelling and Management, 2021 Vol.13 No.1/2, pp.160 - 191

Received: 31 Jul 2018
Accepted: 26 Aug 2019

Published online: 09 Feb 2021 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article