Title: Imputation of missing values for semi-supervised data using the proximity in random forests

Authors: Tsunenori Ishioka

Addresses: Research Division, The National Center for University Entrance Examinations, 2-19-23 Komaba, Meguro-ku, Tokyo 153-8501, Japan

Abstract: This paper presents a procedure that imputes missing values by using random forests on semi-supervised data. Applying our method to Hewlett-Packard Lab.'s spam data and Edgar Anderson's iris data, we found that the rate of correct classification is higher than that of other methods: a simple expansion of Liaw's 'rfImpute' for (un)supervised data and the k-nearest neighbour method (kNN). Our method allows missing predictor variables as well as missing response variable. An imputation that uses random forests for semi-supervised cases in the training dataset has never been implemented until now.

Keywords: ensemble learning; data imputation; missing data; k-nearest neighbour; kNN; R; rfImpute; random forests; semi-supervised learning; predictor variables; response variables; missing values.

DOI: 10.1504/IJBIDM.2013.057737

International Journal of Business Intelligence and Data Mining, 2013 Vol.8 No.2, pp.155 - 166

Published online: 28 Jun 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article