Title: Correlation-based feature subset selection technique for web spam classification

Authors: Surender Singh; Ashutosh Kumar Singh

Addresses: Department of Computer Applications, National Institute of Technology, Kurukshetra, Haryana-136119, India; Department of Information Technology, Maharaja Surajmal Institute of Technology, New Delhi-110058, India ' Department of Computer Applications, National Institute of Technology, Kurukshetra, Haryana-136119, India

Abstract: In past years, different machine learning algorithms and web spam features have been created to recognise the spam. The key part of progression of machine learning (ML) depends on the features being utilised. If we have features which correlate with each other then it is easy for ML to learn and if we have features which are very complex then ML may not be able to learn. It is the most imperative and basic area where the majority of the applications in a machine learning are going on. In this paper, correlation-based feature selection (CFS) technique (with best-first search) is used which selects features that are most efficient. Two datasets (WebSpam-UK2006 and WebSpam-UK2007) and four classifiers (Naïve Bayes, J48, random forest and AdaBoost) are used for conducting the experiment. The results have shown significant improvement in AUC (area under receiver operating characteristic curve) for Naïve Bayes and J48.

Keywords: web spam; machine learning; filter approach; correlation-based feature selection; CFS; best-first search; BFS.

DOI: 10.1504/IJWET.2018.097562

International Journal of Web Engineering and Technology, 2018 Vol.13 No.4, pp.363 - 379

Published online: 28 Jan 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article