Title: Uyghur short-text classification based on reliable sub-word morphology

Authors: Sardar Parhat; Mijit Ablimit; Askar Hamdulla

Addresses: Information Science and Engineering Institute, Xinjiang University, Urumqi, 830046, China ' Information Science and Engineering Institute, Xinjiang University, Urumqi, 830046, China ' Information Science and Engineering Institute, Xinjiang University, Urumqi, 830046, China

Abstract: In this paper, we research some short-text classification methods for a low resource language combined with reliable stemming and term extraction methods. Uyghur is a morphologically rich agglutinative language in which words are formed by a stem attached by several suffixes, and this property causes infinite vocabulary in theory. As the stems are the semantic entities, stem based text classification is the promising way for the low resource morphologically derivative languages. And it is also an efficient way in NLP to extract and predict out-of-vocabulary (OOV) and misspellings based on context information. The word (or stem) - vector-based morphological analysis incorporating stem-vector to text classification is a novel approach for the Uyghur language. Our stemming method extracts noisy stems robustly and decrease the particle lexicon to 1/3 of word lexicon and improve the coverage, thus suited for small corpora with high OOV rate resources. And the highest accuracy of 93.5% is obtained in nine categories of short texts based on stem-vector with CHI-2 (x2) feature.

Keywords: word embedding; text classification; morphology; Uyghur.

DOI: 10.1504/IJRIS.2019.102606

International Journal of Reasoning-based Intelligent Systems, 2019 Vol.11 No.3, pp.250 - 255

Received: 22 Aug 2018
Accepted: 07 Jan 2019

Published online: 30 Sep 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article