Open Access Article

Title: Cross-modal Chinese text representation enhancement for multimodal sentiment analysis

Authors: Xin Zhang

Addresses: Lanzhou Resources and Environment Voc-Tech University, Lanzhou, 730060, China

Abstract: Addressing the dual challenges of textual vulnerability to noise and inefficient cross-modal interaction in Chinese multimodal sentiment analysis, this paper introduces a novel framework enhanced by a cross-modal text enhancement module (CTEM). The CTEM adaptively recalibrates semantic representations of Chinese text through contextual refinement. Concurrently, a cross-modal attention mechanism directs visual and acoustic feature extraction, enabling synergistic fusion across modalities. Evaluated on the Chinese single and multimodal sentiment (CH-SIMS) benchmark (featuring unaligned video segments and dual sentiment labels), our model achieves 83.2% accuracy - surpassing mainstream baselines by up to 3.2% with a 0.029 F1-score gain. Ablation studies confirm the critical contributions of both the CTEM representation refinement and cross-modal interaction design. This work establishes a robust paradigm for decoding nuanced sentiment in linguistically complex Chinese multimedia content.

Keywords: cross-modal text information enhancement; multimodal sentiment analysis; Chinese semantic understanding; feature fusion; attention mechanism.

DOI: 10.1504/IJICT.2025.149050

International Journal of Information and Communication Technology, 2025 Vol.26 No.35, pp.89 - 103

Received: 07 Jul 2025
Accepted: 17 Aug 2025

Published online: 10 Oct 2025 *