Article: Res-THL: multimodal pre-training models for document understanding Journal: International Journal of Wireless and Mobile Computing (IJWMC) 2025 Vol.29 No.3 pp.248 - 255 Abstract: Multimodal pre-training models based on the Transformer architecture have been widely adopted in visually-rich document understanding, achieving impressive results. However, existing models still face limitations in effectively extracting features from diverse modalities present in visually-rich document data. In this paper, we propose a pre-training model called Res-THL that facilitates interactive modelling between image, layout and text modalities. To address the inadequacy of existing methods in extracting image features, we introduce a new image feature extraction module called Res, which incorporates a residual structure to capture richer image representations. Additionally, existing studies often overlook the learning of features embedded within the hidden layers of multilayer attention networks. To overcome this limitation, we design the Transformer Hidden Layer Learning (THL) module, which integrates a spatial attention mechanism to adaptively learn layout and text features embedded within the Transformer hidden layers. Res-THL are experimented on the FUNSD data set, and the results demonstrate that the proposed Res-THL network achieves enhanced performance, with the F1 score of 0.8325. Inderscience Publishers - linking academia, business and industry through research

Title: Res-THL: multimodal pre-training models for document understanding

Authors: Lei Zhang; Yong Wang; Nan Yang; Bin Jiang

Addresses: School of Artificial Intelligence, Chongqing University of Technology, Chongqing, China ' School of Artificial Intelligence, Chongqing University of Technology, Chongqing, China ' School of Artificial Intelligence, Chongqing University of Technology, Chongqing, China ' School of Artificial Intelligence, Chongqing University of Technology, Chongqing, China

Abstract: Multimodal pre-training models based on the Transformer architecture have been widely adopted in visually-rich document understanding, achieving impressive results. However, existing models still face limitations in effectively extracting features from diverse modalities present in visually-rich document data. In this paper, we propose a pre-training model called Res-THL that facilitates interactive modelling between image, layout and text modalities. To address the inadequacy of existing methods in extracting image features, we introduce a new image feature extraction module called Res, which incorporates a residual structure to capture richer image representations. Additionally, existing studies often overlook the learning of features embedded within the hidden layers of multilayer attention networks. To overcome this limitation, we design the Transformer Hidden Layer Learning (THL) module, which integrates a spatial attention mechanism to adaptively learn layout and text features embedded within the Transformer hidden layers. Res-THL are experimented on the FUNSD data set, and the results demonstrate that the proposed Res-THL network achieves enhanced performance, with the F1 score of 0.8325.

Keywords: deep learning; multimodal natural language processing; pre-trained models; document understanding.

DOI: 10.1504/IJWMC.2025.148584

International Journal of Wireless and Mobile Computing, 2025 Vol.29 No.3, pp.248 - 255

Received: 13 Aug 2023
Accepted: 17 Feb 2024
Published online: 14 Sep 2025 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Res-THL: multimodal pre-training models for document understanding

Keep up-to-date