Title: Reversing the effects of tokenisation attacks against content-based spam filters

Authors: Igor Santos; Carlos Laorden; Borja Sanz; Pablo G. Bringas

Addresses: S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain ' S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain ' S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain ' S³Lab, DeustoTech - Computing, Deusto Institute of Technology, University of Deusto, Avenida de las Universidades 24, 48007, Bilbao, Spain

Abstract: More than 85% of the received emails are spam. Many current solutions feature machine-learning algorithms trained using statistical representations of the terms that most commonly appear in such emails. However, there are attacks that can subvert the filtering capabilities of these methods. Tokenisation attacks insert characters within words, subverting these methods. In this paper, we introduce a new method that reverses the effects of tokenisation attacks. Our method processes emails iteratively by considering possible words, starting from the first token and compares the word candidates with a common dictionary to which spam words have been previously added. We provide an empirical study of how tokenisation attacks affect the filtering capability of a Bayesian classifier and we show that our method can reverse the effects of tokenisation attacks.

Keywords: security; tokenisation attacks; content-based spam filters; spam emails; filtering; Bayesian classifier.

DOI: 10.1504/IJSN.2013.055944

International Journal of Security and Networks, 2013 Vol.8 No.2, pp.106 - 116

Available online: 16 Aug 2013 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article