Title: New rules-based algorithm to improve Arabic stemming accuracy

Authors: Walid Cherif; Abdellah Madani; Mohamed Kissi

Addresses: Laboratory LIMA, Department of Computer Science, Faculty of Sciences, University Chouaib Doukkali, BP 20, 24000 El Jadida, Morocco ' Laboratory LAROSERI, Department of Computer Science, Faculty of Sciences, University Chouaib Doukkali, BP 20, 24000 El Jadida, Morocco ' Laboratory LIMA, Department of Computer Science, Faculty of Sciences, University Chouaib Doukkali, BP 20, 24000 El Jadida, Morocco

Abstract: In the recent past, the world has been witnessing a steady increase in the area of natural language processing owing to the spread of the internet. However, attempts and efforts devoted for Arabic language are still limited. By morphological and semantic properties, Arabic is considered a difficult language in the field of automatic processing. From that perspective, many different approaches were attempted to deal with the morphological variation and the agglutination phenomenon while stemming Arabic texts. Formally, stemming and light-stemming are used to remove irrelevant morphological variations from a given word, and extract its original stem or root. This research introduces a complete new rules-based algorithm. This involves precise removal of affixes based on context-sensitive morphological rules and then deduces the root according to a predefined set of rules. Finally, results show that the accuracy of the proposed algorithm is higher than the two well-known Arabic stemmers.

Keywords: Arabic language; automatic language processing; light stemming; natural language processing; NLP; morphological variation; agglutination.

DOI: 10.1504/IJKEDM.2015.074082

International Journal of Knowledge Engineering and Data Mining, 2015 Vol.3 No.3/4, pp.315 - 336

Available online: 05 Jan 2016 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article