Authors: Rajendra Kumar Roul
Addresses: Department of Computer Science and Information System, BITS, Pilani – K.K. Birla Goa Campus, Goa-403726, India
Abstract: Web spamming generally increases the ranking of some unimportant pages higher in the search results. Detecting and eliminating such spam pages are the need of the day, which mislead the search engine to obtain high-quality information. Aiming in this direction, this study focuses on two important aspects of machine learning. First, it proposes a new content-based spam detection technique which identifies nine important features that help to detect a page is either spam or non-spam. Each feature has an associated value which is calculated by parsing the documents and then performing the require techniques i.e. necessary steps to compute its score. These nine important features along with the class label (spam or non-spam) generate a feature vector for training the classifiers in order to detect the spam pages. Secondly, it highlights the importance of deep learning using multilayer extreme learning machine in the field of spam page detection. For experimental work, two benchmark datasets (WEBSPAM-UK2002 and WEBSPAM-UK2006) have been used and the results using multilayer ELM are found to be more promising compared to other established classifiers.
Keywords: content-based; deep learning; extreme learning machine; multilayer ELM; support vector machine; spam page.
International Journal of Big Data Intelligence, 2018 Vol.5 No.1/2, pp.49 - 61
Received: 15 Apr 2016
Accepted: 25 Nov 2016
Published online: 29 Sep 2017 *