Title: Elementary discourse unit segmentation for Vietnamese texts

Authors: Chinh Trong Nguyen; Dang Tuan Nguyen

Addresses: University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam ' Saigon University, Ho Chi Minh City, Vietnam

Abstract: Elementary discourse unit (EDU) segmentation is an important problem in discourse analysis of text. In Vietnam, we do not have any tool or model official published to solve this problem yet. Therefore, we would like to propose a solution for this problem. Our approach is to apply a sequential labelling method for identifying the beginning of each EDU in a sentence. For sequential labelling method, we use a deep neural network architecture containing a BERT for generating word feature vectors as transfer learning approach and a feed forward neural network for identifying the tag of every word. For building the model, we have automatically built an EDU segmentation dataset from a Vietnamese constituent treebank NIIVTB and used this dataset to fine-tune PhoBERT pretrained model. The results show that our EDU segmentation model has span-based F1 score of 0.8, which is sufficient to be used in practical tasks.

Keywords: EDU segmentation; sequential labelling; BERT; transfer learning.

DOI: 10.1504/IJIIDS.2022.124090

International Journal of Intelligent Information and Database Systems, 2022 Vol.15 No.3, pp.249 - 266

Received: 10 Feb 2021
Accepted: 17 May 2021

Published online: 12 Jul 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article