Title: A new scalable approach for missing value imputation in high-throughput microarray data on apache spark

Authors: Madhuri Gupta; Bharat Gupta

Addresses: Department of CS&IT, Jaypee Institute of Information Technology, Noida, UP, India ' Department of CS&IT, Jaypee Institute of Information Technology, Noida, UP, India

Abstract: Data acquisition of high-dimensional data is performed using High-Throughput Technology (HTT). Data extracted using HTT contain the large amount of missing values. Gene expression data are vital in healthcare research; therefore, reconstruction of missing value is a challenging task. In the research work, a scalable technique PC-ImNN is proposed that stands for Pearson correlation involving with Monte Carlo and modified Nearest Neighbour method to predict the missing value. Monte Carlo is the technique that uses the procedure of repeated random sampling to make numerical estimations of unknown parameters. Pearson correlation combined with Monte Carlo to maintain the distribution of estimated datapoints. Nearest Neighbour technique is applied to find the nearest estimated datapoints. Proposed model is compared with five existing imputation techniques. The result shows that proposed technique performs better in term of mean square error and imputation accuracy. In the work, Apache Spark is used to speed up the performance.

Keywords: missing value; Pearson's correlation; nearest neighbour; mean square error; Monte Carlo method; support vector machine; microarray data.

DOI: 10.1504/IJDMB.2020.10027156

International Journal of Data Mining and Bioinformatics, 2020 Vol.23 No.1, pp.79 - 100

Received: 02 Oct 2019
Accepted: 02 Jan 2020

Published online: 28 Feb 2020 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article