Article: Personalised video summarisation using video-text multi-modal fusion Journal: International Journal of Computational Vision and Robotics (IJCVR) 2025 Vol.15 No.3 pp.379 - 394 Abstract: Video summarisation techniques have evolved in recent years, mostly focusing on visual material and ignoring user preferences. In this work, the topic of query-focused video summarisation is addressed. Long videos are given as input, and the goal is to produce a query-focused video summary using the user's sentences rather than keywords. The two parts of the proposed personalised video summarisation (PVS) system are the query-relevance computation module and the feature encoding network. In order to provide a customised video summary, the suggested end-to-end approach combines encoded visual and textual information and assigns a query relevance score. The suggested PVS model is tested using the fast-text and Resnet embeddings on the video-query dataset. In comparison to various combinations of language and vision models, the suggested PVS model performs better and achieves an accuracy of 0.53%. This study assists the research community to work in the field of multimodal video summarisation. Inderscience Publishers - linking academia, business and industry through research

Title: Personalised video summarisation using video-text multi-modal fusion

Authors: Rakhi Akhare; Subhash K. Shinde

Addresses: Lokmanya Tilak College of Engineering, Vikas Nagar, Sector 4, Koparkhairne, Navi Mumbai-400079, India ' Lokmanya Tilak College of Engineering, Vikas Nagar, Sector 4, Koparkhairne, Navi Mumbai-400079, India

Abstract: Video summarisation techniques have evolved in recent years, mostly focusing on visual material and ignoring user preferences. In this work, the topic of query-focused video summarisation is addressed. Long videos are given as input, and the goal is to produce a query-focused video summary using the user's sentences rather than keywords. The two parts of the proposed personalised video summarisation (PVS) system are the query-relevance computation module and the feature encoding network. In order to provide a customised video summary, the suggested end-to-end approach combines encoded visual and textual information and assigns a query relevance score. The suggested PVS model is tested using the fast-text and Resnet embeddings on the video-query dataset. In comparison to various combinations of language and vision models, the suggested PVS model performs better and achieves an accuracy of 0.53%. This study assists the research community to work in the field of multimodal video summarisation.

Keywords: personalised video summarisation; PVS; word embedding; feature fusion; multi-modal video summarisation; query based video summarisation.

DOI: 10.1504/IJCVR.2025.146294

International Journal of Computational Vision and Robotics, 2025 Vol.15 No.3, pp.379 - 394

Received: 10 Mar 2023
Accepted: 15 Nov 2023
Published online: 19 May 2025 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Personalised video summarisation using video-text multi-modal fusion

Keep up-to-date