Open Access Article

Title: Japanese pronunciation detection and corpus construction based on cross-modal attention

Authors: Xiaolu Liu

Addresses: Global Language Center, Xi'an Eurasia University, Xi'an, 710065, China

Abstract: To address Japanese pronunciation error detection, this paper proposes a fusion method based on cross-modal attention mechanisms and constructs a Japanese pronunciation corpus. The model integrates audio Mel-spectrogram and visual lip-motion features through attention mechanisms, effectively capturing fine-grained cross-modal interactions and enabling precise phoneme-level error recognition. Evaluated on both the public corpus from Saruwatari Lab, University of Tokyo and a self-built corpus, the proposed approach achieves an accuracy of 92.3%, which is 3.1% higher than the best baseline model. Moreover, it maintains a robust accuracy of 85.3% under a low signal-to-noise ratio of 5 db, representing a 6.6% improvement compared to other methods. This study provides an effective and noise-robust tool for multimodal speech learning with strong potential for educational applications. The released corpus contains 50 hours of multimodal data with detailed annotations, offering comprehensive support for Japanese language teaching and advanced speech technology development.

Keywords: cross-modal learning; pronunciation error detection; Japanese speech processing; attention mechanisms; corpus construction.

DOI: 10.1504/IJICT.2025.150403

International Journal of Information and Communication Technology, 2025 Vol.26 No.43, pp.61 - 77

Received: 25 Sep 2025
Accepted: 25 Oct 2025

Published online: 12 Dec 2025 *