Article: A novel SMS spam dataset and bi-directional transformer based short-text representations for SMS spam detection Journal: International Journal of Information and Decision Sciences (IJIDS) 2024 Vol.16 No.4 pp.341 - 359 Abstract: Short message service (SMS) is a form of exchanging short messages over mobile phones without the internet. Unfortunately, the SMS service's popularity is exploited to send irrelevant and malicious messages to entrap users into scams and frauds. In this work, we investigate the performance of state-of-the-art bi-directional encoder representations from transformers for short-text messages in SMS data. For evaluation, we curate a novel augmented SMS spam dataset by extending a classical SMS spam dataset to further categorise spam SMS messages into four fine-grained categories, namely, indecent, malicious, promotional, and updates. We perform experiments on the standard benchmark SMS dataset of spam and non-spam and on our curated multi-class SMS spam dataset. We find that BERT based short-text representations outperform the baseline traditional approach of using handcrafted text-based features by 15%-30% for different machine learning algorithms in terms of accuracy on multi-class SMS spam dataset. Inderscience Publishers - linking academia, business and industry through research

Title: A novel SMS spam dataset and bi-directional transformer based short-text representations for SMS spam detection

Authors: Srishti Maheshwari; Shubhangi Aggarwal; Rishabh Kaushal

Addresses: Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India ' Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India ' Department of Information Technology, Indira Gandhi Delhi Technical University for Women, India

Abstract: Short message service (SMS) is a form of exchanging short messages over mobile phones without the internet. Unfortunately, the SMS service's popularity is exploited to send irrelevant and malicious messages to entrap users into scams and frauds. In this work, we investigate the performance of state-of-the-art bi-directional encoder representations from transformers for short-text messages in SMS data. For evaluation, we curate a novel augmented SMS spam dataset by extending a classical SMS spam dataset to further categorise spam SMS messages into four fine-grained categories, namely, indecent, malicious, promotional, and updates. We perform experiments on the standard benchmark SMS dataset of spam and non-spam and on our curated multi-class SMS spam dataset. We find that BERT based short-text representations outperform the baseline traditional approach of using handcrafted text-based features by 15%-30% for different machine learning algorithms in terms of accuracy on multi-class SMS spam dataset.

Keywords: spam classification; machine learning; word embedding; representation learning; short message service; SMS.

DOI: 10.1504/IJIDS.2024.142636

International Journal of Information and Decision Sciences, 2024 Vol.16 No.4, pp.341 - 359

Received: 17 Nov 2021
Accepted: 05 Mar 2022
Published online: 14 Nov 2024 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: A novel SMS spam dataset and bi-directional transformer based short-text representations for SMS spam detection

Keep up-to-date