Title: A cluster and label approach for classifying imbalanced data streams in the presence of scarcely labelled data

Authors: Kiran Bhowmick; Meera Narvekar

Addresses: Department of Computer Engineering, D J Sanghvi College of Engineering, Mumbai, 400056, India ' Department of Computer Engineering, D J Sanghvi College of Engineering, Mumbai, 400056, India

Abstract: Classifying imbalanced data streams is often a challenging task primarily due to the continuous flow of infinite data and due to the unavailability of class labels. The problem is two-fold when the stream is imbalanced in nature. Due to the characteristics of data streams, it is impossible to store and process the data and deal with imbalance. There is a need to provide a solution that can consider the unavailability of class labels and classify the imbalanced data streams. This paper proposes a semi-supervised learning (SSL)-based model to classify scarcely labelled imbalanced data streams. A modified cluster and label SSL approach that uses expectation maximisation for clustering and similarity-based label propagation for labelling the unlabelled clusters is proposed. The model also employs a novel imbalance sensitive cluster merge technique to deal with the imbalance data. The results prove that the model outperforms standard stream classification algorithms.

Keywords: data streams; classification; imbalanced data; semi-supervised learning; scarcely labelled; cluster and label; micro cluster; label propagation.

DOI: 10.1504/IJBIDM.2022.126503

International Journal of Business Intelligence and Data Mining, 2022 Vol.21 No.4, pp.443 - 464

Received: 14 Apr 2021
Accepted: 26 Jun 2021

Published online: 27 Oct 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article