Title: Synergistic effects between data corpora properties and machine learning performance in data pipelines

Authors: Roberto Bertolini; Stephen J. Finch

Addresses: Department of Applied Mathematics and Statistics, Stony Brook University, Math Tower, Room P-139A, Stony Brook, NY 11794-3600, USA ' Department of Applied Mathematics and Statistics, Stony Brook University, Math Tower, Room P-139A, Stony Brook, NY 11794-3600, USA

Abstract: To analyse data, a computationally feasible pipeline must be developed for data modelling. Corpora properties affect performance variability of machine learning (ML) techniques in pipelines; however, this has not been thoroughly investigated using simulation methodologies. A Monte Carlo study is used to compare differences in the area under the curve (AUC) metric for large-n-small-p-corpora examining: 1) the choice of ML algorithm; 2) size of the training database; 3) measurement error; 4) class imbalance magnitude; 5) missing data pattern. Our simulations are consistent with established results under which these algorithms and corpora properties perform best, while providing insights into their synergistic effects. Measurement error negatively impacted pipeline performance across all corpora factors and ML algorithms. A larger training corpus ameliorated the decrease in predictive efficacy resulting from measurement error, class imbalance magnitudes, and missing data patterns. We discuss the implications of these findings for designing pipelines to enhance prediction performance.

Keywords: data pipeline; interaction/synergistic effects; Monte Carlo simulation; machine learning; binary classification; area under the curve; AUC.

DOI: 10.1504/IJDMMM.2022.125261

International Journal of Data Mining, Modelling and Management, 2022 Vol.14 No.3, pp.217 - 233

Received: 06 Jan 2021
Accepted: 10 Feb 2021

Published online: 05 Sep 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article