Title: English oral pronunciation recognition based on improved deep neural networks

Authors: Lifang Cheng

Addresses: School of Primary Education, Shangrao Preschool Education College, Shangrao, Jiangxi, 334000, China

Abstract: In order to solve the problems of poor performance, low F1-score, and high error rate in traditional English oral pronunciation recognition methods, an English oral pronunciation recognition method based on improved deep neural networks is proposed. Firstly, pre-process the spoken English pronunciation signals and videos to extract audio and lip features. Then, fusion processing is performed on the extracted multimodal features. Finally, the multimodal feature fusion result is used as the output vector, and the English spoken pronunciation recognition result is used as the output vector. By adding an attention module before the first fully connected layer, a deep neural network model is built to obtain the relevant recognition results. The experimental results show that the proposed method has good recognition performance, with an F1-value consistently maintained above 0.95 and an error rate of no more than 1%. It can be further promoted in related fields.

Keywords: spoken English; pronunciation recognition; deep neural network; attention module; multimodal feature fusion.

DOI: 10.1504/IJBM.2026.151088

International Journal of Biometrics, 2026 Vol.18 No.1/2/3, pp.87 - 107

Received: 30 Dec 2024
Accepted: 23 Mar 2025

Published online: 13 Jan 2026 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article