Title: New under-sampling methods to address the problem of unbalanced sentiment classification: application on Arabic datasets

Authors: Asmaa Mountassir; Houda Benbrahim; Ilham Berrada

Addresses: ENSIAS, Mohamed 5 Rabat University, Morocco ' ENSIAS, Mohamed 5 Rabat University, Morocco ' ENSIAS, Mohamed 5 Rabat University, Morocco

Abstract: This paper presents the study we have carried out to address the problem of unbalanced datasets in supervised sentiment classification in an Arabic context. We propose three different methods to under-sample the majority class documents. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We also aim to evaluate the behaviour of the classifier toward different under-sampling rates. We use three different common classifiers, namely Naïve Bayes, support vector machines and k-nearest neighbours. The experiments are carried out on two different Arabic datasets that we have built internally. We show that results obtained on the first dataset, which is slightly skewed, are better than those obtained on the second one which is highly skewed. We conclude also that Naïve Bayes is sensitive to dataset size, the more we reduce the data the more the results degrade. However, support vector machines are highly sensitive to unbalanced datasets. We record an instable behaviour of k-nearest neighbour. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.

Keywords: sentiment classification; unbalanced classification; Arabic language; under-sampling; naive Bayes; support vector machines; SVM; k-nearest neighbour; kNN.

DOI: 10.1504/IJICT.2016.077687

International Journal of Information and Communication Technology, 2016 Vol.9 No.1, pp.64 - 77

Received: 12 Nov 2013
Accepted: 07 Aug 2014

Published online: 13 Jul 2016 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article