Building an annotated corpus for the Albanian language using bilingual projections and regular expressions
by Arbana Kadriu
International Journal of Knowledge Engineering and Data Mining (IJKEDM), Vol. 6, No. 2, 2019

Abstract: We present research done on creating an annotated corpus for the Albanian language. This corpus is achieved combining unsupervised part-of-speech tagging using bilingual projections with regular expressions. Albanian-English text from a free parallel corpus for the Balkan languages is used as a basis. The annotating process is based on the universal part-of-speech tag system. As the result of the projected tagging, we gained a tagged corpus in Albanian for 60,000 sentences. We investigate the main pitfalls in the output gained from the parallel projection and use this analysis to define replacement rules for part of our tagged corpus, which will change 18% of the initial text. We investigate the effectiveness of the tagged corpus using four different part-of-speech taggers, the best result of which is of 94% accuracy. We discuss further improvements to this corpus, which to our knowledge is the biggest annotated corpus in Albanian.

Online publication date: Wed, 17-Jul-2019

The full text of this article is only available to individual subscribers or to users at subscribing institutions.

 
Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.

Pay per view:
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.

Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Knowledge Engineering and Data Mining (IJKEDM):
Login with your Inderscience username and password:

    Username:        Password:         

Forgotten your password?


Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.

If you still need assistance, please email subs@inderscience.com