Title: Deep bi-directional LSTM network with CNN features for human emotion recognition in audio-video signals

Authors: Lovejit Singh

Addresses: Department of Computer Science and Engineering, University Institute of Engineering, Chandigarh University, Mohali, Punjab, India

Abstract: The human emotion detection in audio-video signals is a challenging task. This paper proposed deep bi-directional long short-term memory (Bi-LSTM) network with convolution neural network (CNN) features-based human emotion detection method. First, it utilises the transfer learning Inception-ResNet V2 model to extract the CNN features from audio and video modalities. Furthermore, the frame-wise CNN features sequential information is learned by two separate Bi-LSTM models for audio and video channels, respectively. The weighted product rule-based decision level fusion method computes the final confidence scores with the output probabilities of two independent Bi-LSTM models. The proposed approach is validated, tested, and compared with existing deep learning-based audio-video emotion detection methods on the challenging Ryerson audio-visual database of emotional speech and song (RAVDESS). The experimental results show that the proposed approach has outperformed the existing methods. It has attained 81.03% validation and 83.98% testing emotion detection accuracy on RAVDESS dataset.

Keywords: convolution neural network; CNN; bi-directional long short-term memory network; emotion recognition.

DOI: 10.1504/IJSI.2022.121102

International Journal of Swarm Intelligence, 2022 Vol.7 No.1, pp.110 - 122

Received: 23 Jun 2020
Accepted: 27 Nov 2020

Published online: 24 Feb 2022 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article