Title: ITTS model: speech generation for image captioning using feature extraction for end-to-end synthesis

Authors: Tushar H. Ghorpade; Subhash K. Shinde

Addresses: Department of Computer Engineering, Ramrao Adik Institute of Technology, Nerul, Navi Mumbai, India ' Department of Computer Engineering, Lokmanya Tilak College of Engineering, Kopar Khairane, Navi Mumbai, India

Abstract: The current growth in e-content is attributed to, information exchanged through social media, e-news, etc. Several researchers have proposed an encoder-decoder model with impressive accuracy. This paper exploits feature extraction from images and text for the encoder model using a word embedding method with proposed convolutional layers. State-of-the-art image-to-text and text-to-speech (ITTS) systems learn models separately, one describes the content of an image and the other follows with speech generation. We adopted the Tacotron model for the naturalness of a text with most popular datasets. It can also consistently analyse using evaluation metrics like bilingual evaluation understudy (BLEU), METOr, and mean opinion scale (MOS). The proposed method can significantly enhance the performance and competitive results of a standard image caption and speech generation model. The results show that we obtained an improvement by almost 4% to 5% BLEU score in image captioning model and approximately MOS is 3.73 in speech model.

Keywords: image captioning; convolutional neural network; recurrent neural network; RNN; sequence to sequence language model; text-to-speech synthesis model; Tacotron model; LSTM.

DOI: 10.1504/IJISTA.2023.131569

International Journal of Intelligent Systems Technologies and Applications, 2023 Vol.21 No.2, pp.176 - 198

Received: 08 Jan 2023
Accepted: 27 Mar 2023

Published online: 19 Jun 2023 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article