Title: Binary encoding-based morpheme boundary detection of Dogri language
Authors: Parul Gupta; Shubhnandan S. Jamwal
Addresses: Department of Computer Science and IT, University of Jammu, J&K, India ' Department of Computer Science and IT, University of Jammu, J&K, India
Abstract: Machine learning (ML) models like decision tree, SVM, random forest and KNN are generally used with structured data of morpheme boundary detection but binary representation directly presents the data in a format that can be used by the models without pre-processing. Dogri is an Indo-Aryan language spoken primarily in the state of Jammu and Kashmir, as well as in certain regions of the neighbouring states of Himachal Pradesh and Punjab. This research paper explores and analyses common ML models which are rarely applied in detecting morpheme boundaries. The dataset of 10,000 Dogri words along with their morpheme boundaries in bit values are used for training and evaluation. In this paper, we trained the bi-LSTM and ML models on a different dataset and observed that bi-LSTM outperformed other ML models and exhibited a remarkable recall of 69.50%, 79.64%, and 81.33% on three different datasets respectively.
Keywords: morpheme; Dogri; boundary detection; morphology; language modelling; word segmentation; low-resource languages.
DOI: 10.1504/IJIEI.2025.144269
International Journal of Intelligent Engineering Informatics, 2025 Vol.13 No.1, pp.114 - 133
Received: 24 Jan 2024
Accepted: 04 Jun 2024
Published online: 04 Feb 2025 *