Article: MMHFN: a multimodal deep learning framework for intelligent classification and management of government documents Journal: International Journal of Reasoning-based Intelligent Systems (IJRIS) 2025 Vol.17 No.11 pp.1 - 11 Abstract: Government document intelligence faces significant challenges from dense text, pervasive visual noise (e.g., stamps, low-resolution scans), and high OCR error rates (>25%), hindering automated classification in e-governance. To address this, we propose MMHFN: a multimodal deep learning framework integrating: 1) a Spatial Attention-enhanced MobileNetV3 for noise-robust visual feature extraction; 2) a dual-path text encoder (FastText subword embeddings + domain-adapted BERT) with gated fusion to mitigate OCR errors; 3) lightweight differentiable optimal transport for cross-modal alignment; 4) a Seq2Seq OCR-correction module. Experimental results on RVL-CDIP, Tobacco3482, and GOV-DOCBench datasets show MMHFN achieves 92.7% accuracy (+6.2% over unimodal baselines) and 90.3% F1-score, with only 3.1% accuracy drop under severe OCR noise and real-time edge deployment (190 ms/page). This work contributes an efficient, domain-adapted solution for intelligent document management, publicly releasing the 12,000-document GOV-DOCBench benchmark. Inderscience Publishers - linking academia, business and industry through research

Title: MMHFN: a multimodal deep learning framework for intelligent classification and management of government documents

Authors: Qizhong Luo; Tiantian Huang; Xianli Zeng

Addresses: Guilin University of Electronic Technology, Guilin 541004, China ' College of Environmental Science and Engineering, Guilin University of Technology, Guilin 541004, China ' School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China

Abstract: Government document intelligence faces significant challenges from dense text, pervasive visual noise (e.g., stamps, low-resolution scans), and high OCR error rates (>25%), hindering automated classification in e-governance. To address this, we propose MMHFN: a multimodal deep learning framework integrating: 1) a Spatial Attention-enhanced MobileNetV3 for noise-robust visual feature extraction; 2) a dual-path text encoder (FastText subword embeddings + domain-adapted BERT) with gated fusion to mitigate OCR errors; 3) lightweight differentiable optimal transport for cross-modal alignment; 4) a Seq2Seq OCR-correction module. Experimental results on RVL-CDIP, Tobacco3482, and GOV-DOCBench datasets show MMHFN achieves 92.7% accuracy (+6.2% over unimodal baselines) and 90.3% F1-score, with only 3.1% accuracy drop under severe OCR noise and real-time edge deployment (190 ms/page). This work contributes an efficient, domain-adapted solution for intelligent document management, publicly releasing the 12,000-document GOV-DOCBench benchmark.

Keywords: multimodal deep learning; government document classification; feature fusion; OCR error-correction module; intelligent management.

DOI: 10.1504/IJRIS.2025.148755

International Journal of Reasoning-based Intelligent Systems, 2025 Vol.17 No.11, pp.1 - 11

Received: 14 Jun 2025
Accepted: 20 Aug 2025
Published online: 22 Sep 2025 *

Title: MMHFN: a multimodal deep learning framework for intelligent classification and management of government documents

Keep up-to-date