Open Access Article

Title: MFF-PEA: an automatic assessment model for professional spoken English via multimodal feature fusion

Authors: Fang Gao

Addresses: Department of Basic Courses, Hebei Vocational College of Resources and Environment, Shijiazhuang 050000, China

Abstract: Accurate assessment of professional spoken English necessitates capturing nuanced linguistic accuracy and non-verbal paralinguistic cues in cross-cultural communication settings. To address limitations of unimodal approaches and static fusion methods, we propose Multimodal Feature Fusion-based Professional English Assessment (MFF-PEA), an adaptive framework integrating speech, facial expressions, and gestural dynamics. The core innovation lies in a cross-modal dynamic fusion (CMDF) mechanism that employs learnable attention gates to weight modalities based on contextual relevance. For joint optimisation, a hybrid loss function combines regression loss for absolute scoring and pairwise ranking loss for proficiency discrimination. Rigorous evaluations on multi-domain professional datasets confirm MFF-PEA's significant superiority over state-of-the-art baselines, exhibiting stronger predictive consistency and lower assessment errors. Comprehensive ablation studies validate each architectural component's necessity, while cross-domain tests in business, medical, and legal scenarios demonstrate transferable robustness. This work establishes a context-sensitive paradigm for automated multimodal language assessment.

Keywords: professional oral English assessment; multimodal fusion; dynamic attention; ranking loss; cross-domain evaluation.

DOI: 10.1504/IJICT.2026.151530

International Journal of Information and Communication Technology, 2026 Vol.27 No.5, pp.1 - 15

Received: 18 Jul 2025
Accepted: 08 Sep 2025

Published online: 04 Feb 2026 *