Title: Improving blog spam filters via machine learning

Authors: Weiwen Yang; Linchi Kwok

Addresses: SEAS, Columbia University, New York, NY 10027, USA ' The Collins College of Hospitality Management, California State Polytechnic University Pomona, 3801 W. Temple Ave., Pomona, CA 91768, USA

Abstract: As an important platform of electronic commerce, blogs can greatly influence internet users' purchasing decisions. Spam, however, can substantially reduce blogs' positive impact on electronic commerce. This paper introduces SK, an alternative algorithm combining supervised learning (SVM) and unsupervised learning (K-means++) to detect blog spam. If either classifies a blog as spam, then the blog is assigned to the spam category. Feature selection includes term frequency, inverse document frequency, binary representation, stop words, outgoing links, advertiser content, and burst with keywords. Accuracy of each model was tested and compared in experiments with 3,000 blog pages from University of Maryland and 3,560 internet blogs. Findings suggest that combining the SVM algorithm and K-means++ clustering can increase accuracy of filtering spams by about 7% as compared to using just one of these methods. Strengths and weaknesses of various spam-filtering methods were discussed, providing considerations for businesses when choosing a spam filter.

Keywords: spam filter; support vector machine; SVM; K-means++; machine learning; neural network.

DOI: 10.1504/IJDATS.2017.085901

International Journal of Data Analysis Techniques and Strategies, 2017 Vol.9 No.2, pp.99 - 121

Received: 24 Jun 2015
Accepted: 07 Jan 2016

Published online: 18 Aug 2017 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article