Title: Statistics and linguistic rules in multiword extraction: a comparative analysis

Authors: Shaishav Agrawal; Ratna Sanyal; Sudip Sanyal

Addresses: Indian Institute of Information Technology, Deoghat, Jhalwa, Allahabad-211012, U.P., India ' Indian Institute of Information Technology, Deoghat, Jhalwa, Allahabad-211012, U.P., India ' Indian Institute of Information Technology, Deoghat, Jhalwa, Allahabad-211012, U.P., India

Abstract: A hybrid methodology is proposed for extracting multiword expressions based on linguistic and statistical information. In the proposed methodology, N-grams are extracted by linguistic patterns and then various statistical measures are applied for classifying these N-grams as multiword expressions. To solve the problem of deciding cut-off boundary threshold in statistical filtering phase, a novel method for calculating boundary threshold is designed. Comparative analysis between the baseline method and the proposed methodology is presented. In the baseline method, firstly, N-grams are filtered by statistical measures and then linguistic filtering is applied. Precision, recall and f-Score are calculated on manually annotated corpus. Observed results show that the proposed methodology provides good results for certain types of multiword expressions like compound nouns, verb-particles and verb-verb.

Keywords: multiword expressions; collocation extraction; information retrieval; natural language processing; NLP; statistical methods; computational linguistics; linguistic rules; multiword extraction; boundary threshold; compound nouns; verb-particles; verb-verb.

DOI: 10.1504/IJRIS.2014.063954

International Journal of Reasoning-based Intelligent Systems, 2014 Vol.6 No.1/2, pp.59 - 70

Published online: 22 Nov 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article