Title: Spam filtering using Kolmogorov complexity analysis

Authors: G. Richard, A. Doncescu

Addresses: IRIT, University of Toulouse, France. ' LAAS CNRS, University of Toulouse, France

Abstract: One of the most irrelevant side effects of e-commerce technology is the development of spamming as an e-marketing technique. Spam e-mails (or unsolicited commercial e-mails) induce a burden for everybody having an electronic mailbox: detecting and filtering spam is then a challenging task and a lot of approaches have been developed to identify spam before it is posted in the end user|s mailbox. In this paper, we focus on a relatively new approach whose foundations rely on the works of A. Kolmogorov. The main idea is to give a formal meaning to the notion of |information content| and to provide a measure of this content. Using such a quantitative approach, it becomes possible to define a distance, which is a major tool for classification purposes. To validate our approach, we proceed in two steps: first, we use the classical compression distance over a mix of spam and legitimate e-mails to check out if they can be properly clustered without any supervision. It has been the case to highlight a kind of underlying structure for spam e-mails. In the second step, we have implemented a k-nearest neighbours algorithm providing 85% as accuracy rate. Coupled with other anti-spam techniques, compression-based methods could bring a great help in the spam filtering challenge.

Keywords: spam filtering; Kolmogorov complexity; compression; clustering; k-nearest neighbours; spamming; e-marketing; unsolicited e-mails; information content; anti-spam techniques.

DOI: 10.1504/IJWGS.2008.018500

International Journal of Web and Grid Services, 2008 Vol.4 No.1, pp.136 - 148

Published online: 25 May 2008 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article