Title: Demonstration and analysing the performance of image caption generator: efforts for visually impaired candidates for Smart Cities 5.0

Authors: Rohit Rastogi; Vineet Rawat; Sidhant Kaushal

Addresses: Department of CSE, ABES Engineering College Ghaziabad, U.P., India ' Department of CSE, ABES Engineering College Ghaziabad, U.P., India ' Department of CSE, ABES Engineering College Ghaziabad, U.P., India

Abstract: Image caption generation has become a prominent area of research due to its potential applications in multimedia understanding and accessibility. This paper presents a comprehensive study of three state-of-the-art approaches for image caption generation, employing convolutional neural networks (CNN) with long short-term memory (LSTM) networks, attention mechanisms, and transformers. The first approach utilises a CNN-LSTM architecture, where the CNN acts as an encoder to extract meaningful visual features from input images. These features are then fed into an LSTM-based decoder, enabling the generation of descriptive captions. The second approach introduces the use of attention mechanisms, allowing the model to focus on specific regions of the image while generating captions. This technique improves the caption quality and ensures that the generated text corresponds more accurately to the content in the image. Lastly, the third approach incorporates the powerful transformer architecture to capture long-range dependencies in the generated captions, enabling better contextual understanding and coherence.

Keywords: CNN; LTSM; transformer; image caption; attention mechanism; benchmark; NLP.

DOI: 10.1504/IJAMECHS.2024.143152

International Journal of Advanced Mechatronic Systems, 2024 Vol.11 No.3, pp.161 - 178

Received: 10 Aug 2023
Accepted: 08 Mar 2024

Published online: 04 Dec 2024 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article