Reversing the effects of tokenisation attacks against content-based spam filters
by Igor Santos; Carlos Laorden; Borja Sanz; Pablo G. Bringas
International Journal of Security and Networks (IJSN), Vol. 8, No. 2, 2013

Abstract: More than 85% of the received emails are spam. Many current solutions feature machine-learning algorithms trained using statistical representations of the terms that most commonly appear in such emails. However, there are attacks that can subvert the filtering capabilities of these methods. Tokenisation attacks insert characters within words, subverting these methods. In this paper, we introduce a new method that reverses the effects of tokenisation attacks. Our method processes emails iteratively by considering possible words, starting from the first token and compares the word candidates with a common dictionary to which spam words have been previously added. We provide an empirical study of how tokenisation attacks affect the filtering capability of a Bayesian classifier and we show that our method can reverse the effects of tokenisation attacks.

Online publication date: Sun, 18-Aug-2013

The full text of this article is only available to individual subscribers or to users at subscribing institutions.

Existing subscribers:
Go to Inderscience Online Journals to access the Full Text of this article.

Pay per view:
If you are not a subscriber and you just want to read the full contents of this article, buy online access here.

Complimentary Subscribers, Editors or Members of the Editorial Board of the International Journal of Security and Networks (IJSN):
Login with your Inderscience username and password:

    Username:        Password:         

Forgotten your password?

Want to subscribe?
A subscription gives you complete access to all articles in the current issue, as well as to all articles in the previous three years (where applicable). See our Orders page to subscribe.

If you still need assistance, please email