Title: Binary encoding-based morpheme boundary detection of Dogri language

Authors: Parul Gupta; Shubhnandan S. Jamwal

Addresses: Department of Computer Science and IT, University of Jammu, J&K, India ' Department of Computer Science and IT, University of Jammu, J&K, India

Abstract: Machine learning (ML) models like decision tree, SVM, random forest and KNN are generally used with structured data of morpheme boundary detection but binary representation directly presents the data in a format that can be used by the models without pre-processing. Dogri is an Indo-Aryan language spoken primarily in the state of Jammu and Kashmir, as well as in certain regions of the neighbouring states of Himachal Pradesh and Punjab. This research paper explores and analyses common ML models which are rarely applied in detecting morpheme boundaries. The dataset of 10,000 Dogri words along with their morpheme boundaries in bit values are used for training and evaluation. In this paper, we trained the bi-LSTM and ML models on a different dataset and observed that bi-LSTM outperformed other ML models and exhibited a remarkable recall of 69.50%, 79.64%, and 81.33% on three different datasets respectively.

Keywords: morpheme; Dogri; boundary detection; morphology; language modelling; word segmentation; low-resource languages.

DOI: 10.1504/IJIEI.2025.144269

International Journal of Intelligent Engineering Informatics, 2025 Vol.13 No.1, pp.114 - 133

Received: 24 Jan 2024
Accepted: 04 Jun 2024

Published online: 04 Feb 2025 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article