Title: A novel SMS spam dataset and bi-directional transformer based short-text representations for SMS spam detection
Authors: Srishti Maheshwari; Shubhangi Aggarwal; Rishabh Kaushal
Addresses: Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India ' Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India ' Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India
Abstract: Short message service (SMS) is a form of exchanging short messages over mobile phones without the internet. Unfortunately, the SMS service's popularity is exploited to send irrelevant and malicious messages to entrap users into scams and frauds. In this work, we investigate the performance of state-of-the-art bi-directional encoder representations from transformers for short-text messages in SMS data. For evaluation, we curate a novel augmented SMS spam dataset by extending a classical SMS spam dataset to further categorise spam SMS messages into four fine-grained categories, namely, indecent, malicious, promotional, and updates. We perform experiments on the standard benchmark SMS dataset of spam and non-spam and on our curated multi-class SMS spam dataset. We find that BERT based short-text representations outperform the baseline traditional approach of using handcrafted text-based features by 15%-30% for different machine learning algorithms in terms of accuracy on multi-class SMS spam dataset.
Keywords: spam classification; machine learning; word embedding; representation learning; short message service; SMS.
DOI: 10.1504/IJIDS.2024.142636
International Journal of Information and Decision Sciences, 2024 Vol.16 No.4, pp.341 - 359
Received: 17 Nov 2021
Accepted: 05 Mar 2022
Published online: 14 Nov 2024 *