Title: Ensemble feature selection approach for imbalanced textual data using MapReduce

Authors: Houda Amazal; Mohammed Ramdani; Mohamed Kissi

Addresses: Computer Science Laboratory, Faculty of Sciences and Technologies, University Hassan II Casablanca, BP 146, 20650 Mohammedia, Morocco ' Computer Science Laboratory, Faculty of Sciences and Technologies, University Hassan II Casablanca, BP 146, 20650 Mohammedia, Morocco ' Computer Science Laboratory, Faculty of Sciences and Technologies, University Hassan II Casablanca, BP 146, 20650 Mohammedia, Morocco

Abstract: Feature selection is a fundamental pre-processing phase in text classification. It speeds up machine learning algorithms and improves classification accuracy. In big data context, feature selection techniques have to deal with two major issues which are the huge dimensionality and the imbalancing aspect of data. However, the libraries of big data frameworks, such as Hadoop, only implement a few single feature selection methods whose robustness does not meet the requirements imposed by the large amount of data. To deal with this, we propose in this paper a distributed ensemble feature selection (DEFS) approach for imbalanced large dataset using MapReduce. A set of experiments are being conducted on four datasets to confirm the improvement brought about by the proposed approach. The reported results show that in most cases our method results in better classification performance than other widely used feature selection techniques.

Keywords: ensemble feature selection? EFS? imbalance data? MapReduce? text classification.

DOI: 10.1504/IJBIDM.2021.118925

International Journal of Business Intelligence and Data Mining, 2021 Vol.19 No.4, pp.395 - 417

Received: 18 Nov 2019
Accepted: 28 Feb 2020

Published online: 10 Nov 2021 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article