Open Access Article

Title: A CNN-ViT fusion model for predicting tourist behaviour and consumption intentions

Authors: Chiawei Liu

Addresses: School of Geographical Sciences and Tourism, Jiaying University, Meizhou, 514015, China

Abstract: Facing the problem that current methods have in inferring tourist behaviour and consumption intentions due to the difficulty in deeply mining textual semantic information, this paper optimises convolutional neural networks-based vision transformer algorithm first, and then proposes improved inference tourist behaviour and consumption intentions based on Visual Transformer. In tourist behaviour detection branch, text-aware module is introduced to improve the extraction of tourist image features and enhance the expressive power of textual visual features. In consumption intention inference branch, parallel transformer decoding is performed at both visual and linguistic levels, and semantic information is mined and integrated by positional encoding to realise accurate inference of consumption intentions. The experimental results show that the accuracy of visitor behaviour detection is 96.8%, and the accuracy of consumption intention inference is 94.2%. Compared with the baseline model, the model has high efficiency and is superior.

Keywords: convolutional neural network; CNN; vision transformer; feature extraction; tourist behaviour detection; consumption intent inference.

DOI: 10.1504/IJICT.2026.151489

International Journal of Information and Communication Technology, 2026 Vol.27 No.2, pp.83 - 99

Received: 28 Oct 2025
Accepted: 02 Dec 2025

Published online: 02 Feb 2026 *