Title: Building an annotated corpus for the Albanian language using bilingual projections and regular expressions

Authors: Arbana Kadriu

Addresses: SEE University, Ilindenska bb, Tetovo, Macedonia

Abstract: We present research done on creating an annotated corpus for the Albanian language. This corpus is achieved combining unsupervised part-of-speech tagging using bilingual projections with regular expressions. Albanian-English text from a free parallel corpus for the Balkan languages is used as a basis. The annotating process is based on the universal part-of-speech tag system. As the result of the projected tagging, we gained a tagged corpus in Albanian for 60,000 sentences. We investigate the main pitfalls in the output gained from the parallel projection and use this analysis to define replacement rules for part of our tagged corpus, which will change 18% of the initial text. We investigate the effectiveness of the tagged corpus using four different part-of-speech taggers, the best result of which is of 94% accuracy. We discuss further improvements to this corpus, which to our knowledge is the biggest annotated corpus in Albanian.

Keywords: POS tagging; Albanian language; bilingual projection; corpus creation.

DOI: 10.1504/IJKEDM.2019.100760

International Journal of Knowledge Engineering and Data Mining, 2019 Vol.6 No.2, pp.105 - 121

Received: 19 Sep 2018
Accepted: 30 Jan 2019

Published online: 17 Jul 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article