Authors: Sadaf Kabir; Leily Farrokhvar
Addresses: Department of Industrial and Management Systems Engineering, West Virginia University, 1306 Evansdale Drive, Morgantown, WV 26506, USA ' Department of Systems and Operations Management, California State University Northridge, 18111 Nordhoff St., Northridge, CA 91330, USA
Abstract: Developing accurate predictive models can profoundly help healthcare providers improve the quality of their services. However, medical data often contain several variables, and not all the data equally contribute towards the prediction. The existence of irrelevant and redundant features in a dataset can unnecessarily increase computational cost and complexity while deteriorating the performance of the predictive model. In this study, we employ the gradient-based prediction attribution as a general tool to identify important features in differentiable predictive models, such as neural networks (NN) and linear regression. Built upon this approach, we analyse single-stage and multi-stage scenarios for feature selection using ten medical datasets. Through extensive experiments, we demonstrate that the combination of the gradient-based approach with NN provides a powerful nonlinear technique to identify important features contributing to the prediction. In particular, nonlinear gradient-based feature selection achieves competitive results or significant improvements over previously reported results on all datasets.
Keywords: machine learning; feature selection; neural networks; logistic regression; disease prediction models; healthcare data.
International Journal of Data Mining, Modelling and Management, 2022 Vol.14 No.3, pp.248 - 268
Received: 13 Sep 2020
Accepted: 22 Jan 2021
Published online: 05 Sep 2022 *