Article: Distributed heterogeneous ensemble learning on Apache Spark for ligand-based virtual screening Journal: International Journal of Data Mining, Modelling and Management (IJDMMM) 2021 Vol.13 No.1/2 pp.160 - 191 Abstract: Virtual screening is one of the most common computer-aided drug design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelisation of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models. Inderscience Publishers - linking academia, business and industry through research

Title: Distributed heterogeneous ensemble learning on Apache Spark for ligand-based virtual screening

Authors: Karima Sid; Mohamed Batouche

Addresses: Department of Computer Science, Constantine 2 University-Abdelhamid Mehri, Constantine, Algeria ' Department of Information Technology, CCIS – RC, Princess Nourah University, Riyadh, Saudi Arabia

Abstract: Virtual screening is one of the most common computer-aided drug design techniques that apply computational tools and methods on large libraries of molecules to extract the drugs. Ensemble learning is a recent paradigm launched to improve machine learning results in terms of predictive performance and robustness. It has been successfully applied in ligand-based virtual screening (LBVS) approaches. Applying ensemble learning on huge molecular libraries is computationally expensive. Hence, the distribution and parallelisation of the task have become a significant step by using sophisticated frameworks such as Apache Spark. In this paper, we propose a new approach HEnsL_DLBVS, for heterogeneous ensemble learning, distributed on Spark to improve the large-scale LBVS results. To handle the problem of imbalanced big training datasets, we propose a novel hybrid technique. We generate new training datasets to evaluate the approach. Experimental results confirm the effectiveness of our approach with satisfactory accuracy and its superiority over homogeneous models.

Keywords: virtual screening; big data; computer-aided drug design; CADD; Apache Spark; machine learning; drug discovery; ensemble learning; imbalanced datasets; Spark MLlib; ligand-based virtual screening; LBVS.

DOI: 10.1504/IJDMMM.2021.112920

International Journal of Data Mining, Modelling and Management, 2021 Vol.13 No.1/2, pp.160 - 191

Received: 31 Jul 2018
Accepted: 26 Aug 2019
Published online: 09 Feb 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Distributed heterogeneous ensemble learning on Apache Spark for ligand-based virtual screening

Keep up-to-date