Title: Phish webpage classification using hybrid algorithm of machine learning and statistical induction ratios

Authors: Hiba Zuhair; Ali Selamat

Addresses: Department of Systems Engineering, College of Information Engineering, Al-Nahrain University, Baghdad, Iraq ' Faculty of Engineering, School of Computing, UTM and Media and Games Center of Excellence (MagicX), Universiti Teknologi Malaysia (UTM), Johor, Malaysia; Malaysia Japan International Institute of Technology (MJIIT), Universiti Teknologi, Malaysia, Jalan Sultan Yahya Petra, Kuala Lumpur, Malaysia; Center for Basic and Applied Research, Faculty of Informatics and Management, University of Hradec Kralove, Rokitanskeho 62, 500 03 Hradec Kralove, Czech Republic

Abstract: Although the conventional machine learning-based anti-phishing techniques outperform their competitors in phishing detection, they are still targeted by zero-hour phish webpages due to their constraints of phishing induction. Therefore, phishing induction must be boosted up with the extraction of new features, the selection of robust subsets of decisive features, the active learning of classifiers on a big webpage stream. In this paper, we propose a hybrid feature-based classification algorithm (HFBC) for decisive phish webpage classification. HFBC hybridises two statistical criteria optimised feature occurrence (OFC) and phishing induction ratio (PIR) with the induction settings of the most salient machine learning algorithms, Naïve bays and decision tree. Additionally, we propose two constituent algorithms of features extraction and features selection for holistic phish webpage characterisation. The superiority of our proposed approach is justified and proven throughout chronological, real-time, and comparative analyses against existing machines learning-based anti-phishing techniques.

Keywords: phish webpage; machine learning; optimised feature occurrence; OFC; phishing induction ratio; PIR; hybrid feature-based classifier; HFBC.

DOI: 10.1504/IJDMMM.2020.108727

International Journal of Data Mining, Modelling and Management, 2020 Vol.12 No.3, pp.255 - 276

Received: 10 Oct 2018
Accepted: 31 May 2019

Published online: 23 Jul 2020 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article