Title: Multi-model fusion framework based on multi-input cross-language emotional speech recognition

Authors: Guohua Hu; Qingshan Zhao

Addresses: Department of Computer, Xinzhou Teachers University, Xinzhou, Shanxi, China ' Department of Computer, Xinzhou Teachers University, Xinzhou, Shanxi, China

Abstract: Aiming at the limitations of characteristic performance and network performance in cross-language emotional speech recognition, this paper proposes a multi-model fusion framework based on multi-input cross-language emotional speech recognition. First of all, four kinds of emotional speech shared by four languages are selected as the experimental samples. Secondly, the affective features of three different modes of multi-lingual emotional speech signals are combined with SVM and two deep neural networks (MobileNet26 and ResNet38) to form the basic framework of multi-input corresponding multi-model fusion, in which the feature map in the deep neural network model carries out global maximum pooling and global average pooling so as to capture different features to double the diversity of the model. Finally, through the comparative experimental results, it is found that the multi-model fusion framework can distinguish the emotional differences of multiple languages more effectively than the single network model. At the same time, through the learning of large languages, we can also achieve transfer learning in small language emotional speech recognition to effectively increase the learning ability of the model.

Keywords: multi-input; cross-language; emotional speech recognition; multi-model fusion; transfer learning.

DOI: 10.1504/IJWMC.2021.113221

International Journal of Wireless and Mobile Computing, 2021 Vol.20 No.1, pp.32 - 40

Received: 16 Jun 2020
Accepted: 14 Sep 2020

Published online: 15 Feb 2021 *

Full-text access for editors Access for subscribers Purchase this article Comment on this article