Title: Text document categorisation using random forest and C4.5 decision tree classifier

Authors: Sumathi Pawar; Manjula Gururaj Rao; Karuna Pandith

Addresses: Information Science and Engineering, NMAMIT, Nitte University, Karkala, Karnataka, 574110, India ' Information Science and Engineering, NMAMIT, Nitte University, Karkala, Karnataka, 574110, India ' Information Science and Engineering, NMAMIT, Nitte University, Karkala, Karnataka, 574110, India

Abstract: In reality, documentation is the most significant and rapidly developing field due to the restricted amount of time in the preparation of the documentation. Applications for text classification include language and item identification, document indexing, populating hierarchical catalogues of web resources, and word sense disambiguation. There are numerous texts that serve as documentation and strategies for categorisation have been created to improve efficiency. The proposed system focused on categorising and documenting text using the ensemble learning technique of random forest method and the C4.5 decision tree classifier. This system's processes include construction of decision tree text classifiers, training the constructed models as a part of implementation, dimension reduction, tf/idf indexing of the documents, clustering the terms using brown clustering and running the testing dataset through the classifiers as a part of document categorisation. Orange tool and Python libraries are used to implement the system. It is found that in random forest approach efficiency is increased due to proper construction of text classifiers.

Keywords: dimensionality reduction; KE approach; indexing; machine learning; tf/idf; ensemble learning.

DOI: 10.1504/IJCSYSE.2023.132924

International Journal of Computational Systems Engineering, 2023 Vol.7 No.2/3/4, pp.211 - 220

Received: 01 Aug 2022
Accepted: 18 Apr 2023

Published online: 16 Aug 2023 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article