Title: Ensemble of large self-supervised transformers for improving speech emotion recognition
Authors: Mrunal Prakash Gavali; Abhishek Verma
Addresses: Department of Computer Science, California State University, Northridge, CA, USA ' Department of Computer Science, California State University, Northridge, CA, USA
Abstract: Speech emotion recognition (SER) is a challenging and active field of collaborative, social robotics to improve human-robot interaction (HRI) and affective computing as a feedback mechanism. More recently self-supervised learning (SSL) approaches have become an important method for learning speech representations. We present results of experiments on the challenging large-scale speech emotion RAVDESS dataset. Six very large state-of-the-art self-supervised learning transformer models were trained on the speech emotion dataset. Wav2Vec2.0-XLSR-53 was the most successful of the six level-0 models and achieved classification accuracy of 93%. We propose majority voting ensemble models that combined three and five level-0 models. The five-model and three-model majority voting ensemble models achieved 96.88% and 96.53% accuracy respectively and thereby significantly outperformed the best level-0 model and surpassed the state-of-the-art.
Keywords: speech emotion recognition; SER; self-supervised learning; SSL; emotion AI; transformers; speech processing; acoustic features.
DOI: 10.1504/IJDMMM.2025.146585
International Journal of Data Mining, Modelling and Management, 2025 Vol.17 No.2, pp.217 - 244
Received: 19 Jan 2024
Accepted: 22 Apr 2024
Published online: 05 Jun 2025 *