Title: Deep learning for visual aesthetics: using convolutional vision transformers and HRNet for classifying anime and human selfies
Authors: Congli Zhang
Addresses: Zhengzhou Academy of Fine Arts, Speciality of Fine Arts, Zhengzhou, Henan, 451452, China
Abstract: Digital media today plays a vital role in visual aesthetics and bringing them into play can impact user engagement and be crucial for personalised recommendations of content. Using AI, the task to classify and differentiate between human selfies and animated images, which are hard because of the subtle stylistic changes and the complex feature presentations in both categories. In this research study, we proposed an advanced framework that utilises vision transformers (ViT) and high-resolution networks (HRNet) for classification. With the help of an online dataset, the proposed models not only learn high level representations but also representational contextual dependencies well, classifying test data with 99% accuracy for ViT and 97% for HRNet at a level better than 10% of what traditional convolutional neural network (CNN) based models can achieve. The results leading for automatically content moderation, provide a solid base of using advanced vision models into multimedia and digital content processing.
Keywords: vision transformers; ViT; artificial intelligence; deep learning; visual aesthetics; convolutional neural networks; CNNs; feature extraction; classification.
DOI: 10.1504/IJICT.2025.146811
International Journal of Information and Communication Technology, 2025 Vol.26 No.20, pp.75 - 98
Received: 24 Mar 2025
Accepted: 14 Apr 2025
Published online: 18 Jun 2025 *