Article: Ensemble of large self-supervised transformers for improving speech emotion recognition Journal: International Journal of Data Mining, Modelling and Management (IJDMMM) 2025 Vol.17 No.2 pp.217 - 244 Abstract: Speech emotion recognition (SER) is a challenging and active field of collaborative, social robotics to improve human-robot interaction (HRI) and affective computing as a feedback mechanism. More recently self-supervised learning (SSL) approaches have become an important method for learning speech representations. We present results of experiments on the challenging large-scale speech emotion RAVDESS dataset. Six very large state-of-the-art self-supervised learning transformer models were trained on the speech emotion dataset. Wav2Vec2.0-XLSR-53 was the most successful of the six level-0 models and achieved classification accuracy of 93%. We propose majority voting ensemble models that combined three and five level-0 models. The five-model and three-model majority voting ensemble models achieved 96.88% and 96.53% accuracy respectively and thereby significantly outperformed the best level-0 model and surpassed the state-of-the-art. Inderscience Publishers - linking academia, business and industry through research

Title: Ensemble of large self-supervised transformers for improving speech emotion recognition

Authors: Mrunal Prakash Gavali; Abhishek Verma

Addresses: Department of Computer Science, California State University, Northridge, CA, USA ' Department of Computer Science, California State University, Northridge, CA, USA

Abstract: Speech emotion recognition (SER) is a challenging and active field of collaborative, social robotics to improve human-robot interaction (HRI) and affective computing as a feedback mechanism. More recently self-supervised learning (SSL) approaches have become an important method for learning speech representations. We present results of experiments on the challenging large-scale speech emotion RAVDESS dataset. Six very large state-of-the-art self-supervised learning transformer models were trained on the speech emotion dataset. Wav2Vec2.0-XLSR-53 was the most successful of the six level-0 models and achieved classification accuracy of 93%. We propose majority voting ensemble models that combined three and five level-0 models. The five-model and three-model majority voting ensemble models achieved 96.88% and 96.53% accuracy respectively and thereby significantly outperformed the best level-0 model and surpassed the state-of-the-art.

Keywords: speech emotion recognition; SER; self-supervised learning; SSL; emotion AI; transformers; speech processing; acoustic features.

DOI: 10.1504/IJDMMM.2025.146585

International Journal of Data Mining, Modelling and Management, 2025 Vol.17 No.2, pp.217 - 244

Received: 19 Jan 2024
Accepted: 22 Apr 2024
Published online: 05 Jun 2025 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Ensemble of large self-supervised transformers for improving speech emotion recognition

Keep up-to-date