Article: Inclusion of Wikipedia, a language specific knowledge resource to generate and update a synset in WordNet Journal: International Journal of Technology, Policy and Management (IJTPM) 2019 Vol.19 No.4 pp.405 - 419 Abstract: Lack of competent lexical resources is a ubiquitous fact that negatively affects the development of natural language processing tools for not so widely spoken languages. Recently, projects such as Indo WordNet have significantly reduced the scarcity of lexicons for Indian languages. However, their coverage is still a matter of concern. The cost and time incurred are other limiting factors. The reluctance to automate the process of lexicon generation is majorly credited to the poor precision of the generated synsets. In this paper, we strive to tackle these issues by incorporating language-specific knowledge resources which ensures the authenticity of the generated synsets along with the inclusion of endemic words. We propose a corpus-based approach for automated synset generation which visibly improves the quality of the generated synsets. The experiments performed on a manually created dataset of Hindi words provide a precision of 81.56% and an F-measure of more than 72%. Inderscience Publishers - linking academia, business and industry through research

Title: Inclusion of Wikipedia, a language specific knowledge resource to generate and update a synset in WordNet

Authors: Sunny Rai; Amita Jain; Priyank Pandey

Addresses: Department of Computer Engineering, Netaji Subhas Institute of Technology, Delhi, 110078, India; School of Engineering Sciences, Mahindra Ecole Centrale, Hyderabad, 500043, India ' Department of Computer Science and Engineering, Ambedkar Institute of Advanced Communication Technologies and Research, Delhi, 110031, India ' Department of Computer Science and Engineering, Ambedkar Institute of Advanced Communication Technologies and Research, Delhi, 110031, India

Abstract: Lack of competent lexical resources is a ubiquitous fact that negatively affects the development of natural language processing tools for not so widely spoken languages. Recently, projects such as Indo WordNet have significantly reduced the scarcity of lexicons for Indian languages. However, their coverage is still a matter of concern. The cost and time incurred are other limiting factors. The reluctance to automate the process of lexicon generation is majorly credited to the poor precision of the generated synsets. In this paper, we strive to tackle these issues by incorporating language-specific knowledge resources which ensures the authenticity of the generated synsets along with the inclusion of endemic words. We propose a corpus-based approach for automated synset generation which visibly improves the quality of the generated synsets. The experiments performed on a manually created dataset of Hindi words provide a precision of 81.56% and an F-measure of more than 72%.

Keywords: WordNet; lexical database; Indian languages; NLP; natural language processing; SVM; support vector machine; Wikipedia; machine readable lexicon; machine learning; Wiktionary.

DOI: 10.1504/IJTPM.2019.104062

International Journal of Technology, Policy and Management, 2019 Vol.19 No.4, pp.405 - 419

Received: 29 Nov 2017
Accepted: 22 Apr 2018
Published online: 10 Dec 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Inclusion of Wikipedia, a language specific knowledge resource to generate and update a synset in WordNet

Keep up-to-date