Authors: Namita Gupta; P.C. Saxena; J.P. Gupta
Addresses: Maharaja Agrasen Institute of Technology, PSP Area, Plot No. 1, Sector-22, Rohini, Delhi – 110085, India ' System Sciences, Jawaharlal Nehru University, New Mehrauli Road, New Delhi – 110067, India ' Jaypee Institute of Information, Technology, A-10, Sector-62, Noida, Uttar Pradesh – 201307, India
Abstract: WWW is a repository of large collection of information available in the form of unstructured documents. Therefore, the identification of documents of interest from such a huge pool of documents is very challenging. Text summarisation technique is used in information retrieval for searching document in lesser time. Ranking of documents is made based on the summary or the abstract provided by the authors of the document which is not always possible as not all documents come with an abstract or summary. Also, when different summarisation tools are used to summarise the document, not all the topics covered within the document are reflected in its summary. In this paper, we propose a method to automate the process of text document summarisation based on the term frequency within the document at different levels - paragraph and sentence. To summarise the document, similarity between the paragraphs and sentences within the paragraph is considered using vector space model. Our proposed system evaluation on the standard reference corpus from DUC-2002 using the ROUGE package indicates comparable avg. recall, avg. precision and avg. F-measure to existing summarisation tools - Copernic, SweSum, Extractor, MSWord AutoSummariser, Intelligent, Brevity, Pertinence taking DUC-2002 (100 words) human summary as baseline summary.
Keywords: extract summary; information retrieval; recall-oriented understudy for gisting evaluation; ROUGE tool; text summarisation; vector space model; VSM; document summarisation; sentence ranking; unstructured documents; term frequency; paragraphs; sentences.
International Journal of Data Mining, Modelling and Management, 2013 Vol.5 No.4, pp.380 - 406
Received: 08 May 2021
Accepted: 12 May 2021
Published online: 18 Nov 2013 *