Title: R2DCLT: retrieving relevant documents using cosine similarity and LDA in text mining

Authors: R.S. Ramya; Ganesh Singh; Santosh Nimbhorkar Sejal; K.R. Venugopal; S.S. Iyengar; L.M. Patnaik

Addresses: Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore, India ' Department of Computer Science and Engineering, University Visvesvaraya College of Engineering, Bangalore, India ' Department of Computer Science and Engineering, BNM Institute of Technology, Bangalore, India ' Bangalore University, Bangalore, India ' School of Computing and Information Sciences, Florida International University, USA ' National Institute of Advanced Studies, Indian Institute of Science Campus, Bangalore, India

Abstract: The availability of digital documents over web has increased exponentially and hence there is a need for effective methods to retrieve and organise. Since data is dispersed globally and unorganised, a number of algorithms have been proposed based on relevance calculations. However, it is found that there is a gap between user's search intention and retrieved results. In this paper, we propose a framework for retrieving relevant documents using cosine similarity (CS) and LDA in text mining (R2DCLT). The uniqueness of this approach is that LDA is applied for the documents and extracted patterns like unigram, bigram and trigram. Documents are ranked based on the CS score. Experiments are conducted on Reuters Corpus volume and custom news dataset. It is observed that R2DCLT outperforms pattern taxonomy and relevance feature discovery models by providing high quality relevant documents with improved response time and dynamically updated document set.

Keywords: pattern mining; query search; query expansion; text feature extraction; text mining.

DOI: 10.1504/IJICT.2021.118576

International Journal of Information and Communication Technology, 2021 Vol.19 No.4, pp.391 - 422

Received: 17 Feb 2020
Accepted: 06 May 2020

Published online: 29 Oct 2021 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article