Title: A semi-supervised clustering-based classification model for classifying imbalanced data streams in the presence of scarcely labelled data

Authors: Kiran Bhowmick; Meera Narvekar

Addresses: Department of Computer Engineering, D.J. Sanghvi College of Engineering, Mumbai, 400056, India ' Department of Computer Engineering, D.J. Sanghvi College of Engineering, Mumbai, 400056, India

Abstract: Data streams are potentially infinite in length, fast changing and scarcely labelled. It is practically impossible to label all the observed instances. Online frameworks for classifying data streams are generally supervised in nature assuming the availability of labelled data and hence cannot be used for data streams. Semi-supervised learning (SSL) addresses this problem of scarcely labelled data by using large amount of unlabelled data together with labelled data to build classifiers. Data streams may also suffer from the problem of imbalanced data. Previous works in learning from data streams have analysed problems of imbalanced data. But to the best of our knowledge no work has applied semi-supervised learning approaches for classifying imbalanced data streams so far. This paper proposes a model using a semi-supervised clustering technique to classify an imbalanced data stream in the presence of scarcely labelled data. The results prove that the model outperforms many state-of-the-art techniques.

Keywords: data streams; imbalanced data; semi-supervised clustering; expectation maximisation; partially labelled.

DOI: 10.1504/IJBIDM.2022.120827

International Journal of Business Intelligence and Data Mining, 2022 Vol.20 No.2, pp.170 - 191

Received: 20 Jan 2020
Accepted: 12 May 2020

Published online: 11 Feb 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article