Title: Multimodal attentive fusion for emotion recognition model in children's drama
Authors: Zhuo Cai
Addresses: School of Music and Dance (SMD), Changsha Normal University, Changsha, 410100, China
Abstract: This paper addresses the task of emotion recognition in children's drama performances by proposing an attention-based multimodal feature fusion model. The model extracts fine-grained facial expression features from the visual modality using a pre-trained deep network, and derives Mel-spectrograms and acoustic parameters from the audio modality. These feature streams are then dynamically calibrated and integrated via a cross-modal attention fusion module to capture key emotional cues in dramatic contexts. Evaluated on the public RAVDESS dataset of dramatised speech clips, our model achieves a weighted accuracy of 79.4% and an F1-score of 0.782, demonstrating a significant improvement over feature concatenation-based baseline fusion methods. The results indicate that the model effectively perceives subtle emotional dynamics in theatrical settings, offering a reliable tool for children's affective computing.
Keywords: multimodality; children's theatre; emotion recognition; attentional mechanisms.
DOI: 10.1504/IJICT.2025.151067
International Journal of Information and Communication Technology, 2025 Vol.26 No.50, pp.113 - 132
Received: 22 Oct 2025
Accepted: 17 Nov 2025
Published online: 12 Jan 2026 *


