Title: A systematic review on techniques of feature selection and classification for text mining

Authors: K. Sridharan; P. Sivakumar

Addresses: Department of Information Technology, Panimalar Engineering College, Chennai, Tamil Nadu, India ' Department of Computer Science and Engineering, KSR College of Engineering, Tiruchengode, Namakkal, Tamil Nadu, India

Abstract: Nowadays, there is a quick development in the use of internet. The large amount of structured, unstructured and semi-structured forms like videos, images, audio or texts, can be shared and used on the internet by users. The main analysis of text mining is as follows: pre-processing, feature dimension reduction (feature selection or feature extraction) and text classification, clustering on the final features. In this paper, pre-processing is a step, context sensitive stemmer used to remove the stop words, different suffixes by means to reduce the words count. The unsupervised and supervised feature selection methods like document frequency, term strength, chi-square and information gain are compared to produce the best method for the web document feature selection. The classification techniques like latent semantic analysis, genetic algorithm, Rocchio's algorithm and neural networks are also compared with systematic reviews.

Keywords: information gain; IG; document frequency; DF; term strength; TS; artificial neural network; latent semantic analysis; LSA; text mining; stemming.

DOI: 10.1504/IJBIS.2018.093659

International Journal of Business Information Systems, 2018 Vol.28 No.4, pp.504 - 518

Received: 17 Nov 2016
Accepted: 27 Dec 2016

Published online: 31 Jul 2018 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article