Title: Evaluating the performance of regression algorithms on datasets with missing data

Authors: Luciano Costa Blomberg; Daiane Hemerich; Duncan Dubugras Alcoba Ruiz

Addresses: Graduate Program in Computer Science, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre – RS, Brazil ' Graduate Program in Computer Science, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre – RS, Brazil ' Graduate Program in Computer Science, Pontifical Catholic University of Rio Grande do Sul (PUCRS), Av. Ipiranga 6681, Porto Alegre – RS, Brazil

Abstract: Real-world applications frequently involve missing data, turning the data analysis into a non-trivial task. This paper presents an analysis of six representative regression algorithms, evaluating their predictive performance and sensitivity to missing data. For this purpose, we used 20 public datasets and manipulated them to hold controlled levels of missing data. Our empirical analysis shows that RepTree is the least influenced by missing data, being LinearRegression the next. IBK is the most influenced, presenting the highest error. However, M5P remains as the algorithm with best predictive performance, although being only the fourth less influenced by missing data.

Keywords: business intelligence; missing data; machine learning; data mining; regression algorithms; predictive performance.

DOI: 10.1504/IJBIDM.2013.057744

International Journal of Business Intelligence and Data Mining, 2013 Vol.8 No.2, pp.105 - 131

Received: 05 Jul 2013
Accepted: 11 Jul 2013

Published online: 28 Jun 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article