Title: Data imputation algorithms for mixed variable types in large scale educational assessment: a comparison of random forest, multivariate imputation using chained equations, and MICE with recursive partitioning

Authors: W. Holmes Finch; Maria E. Hernandez Finch; Melissa Singh

Addresses: Department of Educational Psychology, Ball State University, Muncie, IN 47306, USA ' Department of Educational Psychology, Ball State University, Muncie, IN 47306, USA ' Department of Educational Psychology, Ball State University, Muncie, IN 47306, USA

Abstract: Missing data is a major issue with which researchers working on large scale assessments must contend. Such research efforts frequently collect a wide array of variables, including dichotomous, ordinal, nominal, normal, skewed, and counts. This variation in data distributions renders many recommended methods for missing data imputation less than optimal because they assume a single joint probability model for all variables. This simulation study compared four imputation methods, random forest imputation (RF), multivariate imputation by chained equations (MICE), and combinations of the two methods using either the recursive partitioning tree (MICE-RPT) or random forest (MICE-RF) methodologies. Results reveal that data imputed with RF, MICE, MICE-RF, and MICE-RPT yield more accurate parameter estimates than data treated with LD and that MICE-RF and MICE-RPT are associated with more accurate estimates than MICE or RF alone. Implications of these results and recommendations for practice are discussed.

Keywords: missing data; random forest; multivariate imputation; chained equations; MICE; data imputation; recursive partitioning; educational assessment datasets; simulation.

DOI: 10.1504/IJQRE.2016.077803

International Journal of Quantitative Research in Education, 2016 Vol.3 No.3, pp.129 - 153

Received: 23 Dec 2014
Accepted: 02 Oct 2015

Published online: 15 Jul 2016 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article