Title: Web document summarisation using pointwise mutual information (PMI) from web resources

Authors: Atul Kumar Srivastava; Dhiraj Pandey; Alok Aggarwal; Sunil Gupta

Addresses: APJ Abdul Kalam Technical University, Uttar Pradesh, Lucknow, 226031, India ' JSS Academy of Technical Education Noida, Noida, 201301, India ' University of Petroleum and Energy Studies, Dehradun, 248007, India ' University of Petroleum and Energy Studies, Dehradun, 248007, India

Abstract: Nowadays, large amount of data is generated over the internet. It is impossible for the humans to summarise such large chunks of bytes. Therefore, to deal with such challenges, automatic text summarisation systems are deployed. Text Summarisation is the field of data mining that highlights the relevance of important text in a document. In this paper, we proposed a web-based text summarisation approach that generates good quality summary based on total pointwise mutual information (TPMI) scores of the sentences. A sample document from DUC dataset is used which is pre-processed for tokenisation, stop words removal and stemming operations. Based on the extracted words, the TPMI is estimated by calculating the pointwise mutual information (PMI) of the occurrences of words on web search engine. To provide evidence for the robustness of our proposed system, proposed approach is compared with the well-known text summarisation techniques based on sentence length and mean score. The results show that our method outperforms the other techniques by exhibiting best results for closest mean score and generating good quality summary on sentences of different length.

Keywords: document summarisation; text summarisation; PMI; point-wise mutual information.

DOI: 10.1504/IJSSE.2022.127991

International Journal of System of Systems Engineering, 2022 Vol.12 No.4, pp.329 - 353

Received: 18 Aug 2021
Accepted: 25 Oct 2021

Published online: 03 Jan 2023 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article