Authors: Borislava Petrova Vrigazova; Ivan Ganchev Ivanov
Addresses: Sofia University, Bulgaria, 1113 Sofia, 125 Tsarigradsko Shose Blvd., Bl. 3, Bulgaria ' Sofia University, Bulgaria, 1113 Sofia, 125 Tsarigradsko Shose Blvd., Bl. 3, Bulgaria
Abstract: In classification problems, cross-validation chooses random samples from the dataset in order to improve the ability of the model to classify properly new observations in the respective class. Research articles from various fields show that when applied to regression problems, the bootstrap can improve either the prediction ability of the model or the ability for feature selection. The purpose of our research is to show that the bootstrap as a model selection procedure in classification problems can outperform cross-validation. We compare the performance measures of cross-validation and the bootstrap on a set of classification problems and analyse their practical advantages and disadvantages. We show that the bootstrap procedure can accelerate execution time compared to the cross-validation procedure while preserving the accuracy of the classification model. This advantage of the bootstrap is particularly important in big datasets as the time needed for fitting the model can be reduced without decreasing the model's performance.
Keywords: logistic regression; decision tree; k-nearest neighbour; KNN; the bootstrap; cross-validation.
International Journal of Data Mining, Modelling and Management, 2020 Vol.12 No.4, pp.428 - 446
Accepted: 15 Feb 2020
Published online: 05 Nov 2020 *