Title: Bengali paper classification using ensemble machine learning algorithms

Authors: Niaz Ashraf Khan; Emrul Hasan Zawad; Rashedur M. Rahman

Addresses: Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka, 1229, Bangladesh ' Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka, 1229, Bangladesh ' Department of Electrical and Computer Engineering, North South University, Bashundhara, Dhaka, 1229, Bangladesh

Abstract: Text classification is one of the most challenging problems in natural language processing (NLP). Language models are at the heart of NLP. The ability to represent texts as numbers has given rise to many NLP tasks, for example, text categorisation, translation, and summarisation. Unfortunately, NLP for Bengali texts has not reached the state-of-art level of other Languages like English yet, mostly due to the scarcity of resources and the complexities seen in Bengali grammar. Therefore, not much work has been done in this field. In this paper, we have studied one of the word embedding methods, Word2vec, based on continuous bag of words (CBOW) with several ensemble machine learning algorithms, e.g., Adaptive Boosting Classifiers, Light Gradient Boosting Machine, XGboost, and random forest classifiers (RFC). The model is trained on a large corpus of Bengali newspapers of a considerable size that has 99283949 words and 8284804 sentences in 392772 documents. In our experiment, Word2vec CBOW model with XGboost algorithm performed much better than other models and achieved 92.24% accuracy.

Keywords: NLP; natural language processing; categorisation; document classification; decision tree classifier.

DOI: 10.1504/IJKESDP.2022.127625

International Journal of Knowledge Engineering and Soft Data Paradigms, 2022 Vol.7 No.2, pp.77 - 94

Received: 19 Jan 2021
Accepted: 04 Aug 2021

Published online: 13 Dec 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article