Title: Using imputation algorithms when missing values appear in the test data in contrast with the training data

Authors: Narges Sadat Bathaeian

Addresses: Computer Engineering Department, Bu-Ali Sina University, Hamedan, Iran

Abstract: Real datasets suffer from the problem of missing data. Imputation is a common solution for this problem. Most of research works perform imputation algorithms to training data. Therefore, the output variable of samples might influence the imputation model. This paper aims to compare different imputation algorithms when they are applied to test data and training data. In this research, first, the relations between output variable and different imputation algorithms are investigated. Then six different types of imputation algorithms are applied to both training data and test data. Chosen datasets are globally available, and cover both classification and regression tasks. Also missing values are injected artificially to them. The results showed that performance of all algorithms will reduce in the case of elimination of output variable. Particularly, decline in algorithm, which uses k nearest neighbour for imputation in the classification datasets is not ignorable. Nevertheless, algorithms that are based on random forests have less decline and show better results compared with other five types of algorithms.

Keywords: missing values; imputation algorithms; regression; kNN; MICE; random forest; tree; EM.

DOI: 10.1504/IJDATS.2018.092447

International Journal of Data Analysis Techniques and Strategies, 2018 Vol.10 No.2, pp.111 - 123

Accepted: 10 Nov 2016
Published online: 21 Jun 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article