Title: MASCNN: speech emotion recognition using multi-head area self-attention convolutional neural network
Authors: Qianli Ma; Wenjie Zhang; Diqun Yan
Addresses: College of Information Science and Engineering, Ningbo University, Ningbo, 315211, China ' College of Information Science and Engineering, Ningbo University, Ningbo, 315211, China ' College of Information Science and Engineering, Ningbo University, Ningbo, 315211, China
Abstract: Speech emotion recognition (SER) is pivotal for enhancing human-computer interaction by interpreting emotional expressions. Traditional machine learning approaches often encounter limitations in accuracy and adaptability. This study introduces a novel deep learning-based SER method incorporating attention mechanisms. Data augmentation techniques, including noise injection and speech modification, are applied prior to feature extraction. The Mel-frequency cepstral coefficient (MFCC) features are then processed using a convolutional neural network (CNN) enhanced with a Multi-head Area Self-attention mechanism to improve emotion classification. Evaluation on the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) database demonstrates that the proposed method outperforms existing SER techniques in recognition accuracy, with data augmentation significantly enhancing model performance. Additionally, guided class speech elicitation proves more effective than deductive class speech.
Keywords: SER; speech emotion recognition; attention mechanism; CNN; convolutional neural network; IEMOCAP; Interactive Emotional Dyadic Motion Capture Database.
DOI: 10.1504/IJAACS.2025.149806
International Journal of Autonomous and Adaptive Communications Systems, 2025 Vol.18 No.5, pp.403 - 420
Received: 03 Nov 2024
Accepted: 02 Dec 2024
Published online: 13 Nov 2025 *