Authors: K. Sridharan; P. Sivakumar
Addresses: Department of Information Technology, Panimalar Engineering College, Chennai, Tamil Nadu, India ' Department of Computer Science and Engineering, KSR College of Engineering, Tiruchengode, Namakkal, Tamil Nadu, India
Abstract: Nowadays, there is a quick development in the use of internet. The large amount of structured, unstructured and semi-structured forms like videos, images, audio or texts, can be shared and used on the internet by users. The main analysis of text mining is as follows: pre-processing, feature dimension reduction (feature selection or feature extraction) and text classification, clustering on the final features. In this paper, pre-processing is a step, context sensitive stemmer used to remove the stop words, different suffixes by means to reduce the words count. The unsupervised and supervised feature selection methods like document frequency, term strength, chi-square and information gain are compared to produce the best method for the web document feature selection. The classification techniques like latent semantic analysis, genetic algorithm, Rocchio's algorithm and neural networks are also compared with systematic reviews.
Keywords: information gain; IG; document frequency; DF; term strength; TS; artificial neural network; latent semantic analysis; LSA; text mining; stemming.
International Journal of Business Information Systems, 2018 Vol.28 No.4, pp.504 - 518
Received: 17 Nov 2016
Accepted: 27 Dec 2016
Published online: 31 Jul 2018 *