Article: Exploring the effects of non-local blocks on video captioning networks Journal: International Journal of Computational Vision and Robotics (IJCVR) 2019 Vol.9 No.5 pp.502 - 514 Abstract: In addition to visual features, video also contains temporal information that contributes to semantic meaning regarding the relationships between objects and scenes. There have been many attempts to describe spatial and temporal relationships in video, but simple encoder-decoder models are not sufficient for capturing detailed relationships in video clips. A video clip often consists of several shots that seem to be unrelated, and simple recurrent models suffer from these changes in shots. In other fields, including visual question answering and action recognition, researchers began to have interests in describing visual relations between the objects. In this paper, we introduce a video captioning method to capture temporal relationships with a non-local block and boundary-aware system. We evaluate our approach on a Microsoft video description Corpus (MSVD, YouTube2Text) dataset and a Microsoft research-video to text (MSR-VTT) dataset. The experimental results show that a non-local block applied along a temporal axis can improve video captioning performance on video captioning datasets. Inderscience Publishers - linking academia, business and industry through research

Title: Exploring the effects of non-local blocks on video captioning networks

Authors: Jaeyoung Lee; Junmo Kim

Addresses: School of Electrical Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea ' School of Electrical Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea

Abstract: In addition to visual features, video also contains temporal information that contributes to semantic meaning regarding the relationships between objects and scenes. There have been many attempts to describe spatial and temporal relationships in video, but simple encoder-decoder models are not sufficient for capturing detailed relationships in video clips. A video clip often consists of several shots that seem to be unrelated, and simple recurrent models suffer from these changes in shots. In other fields, including visual question answering and action recognition, researchers began to have interests in describing visual relations between the objects. In this paper, we introduce a video captioning method to capture temporal relationships with a non-local block and boundary-aware system. We evaluate our approach on a Microsoft video description Corpus (MSVD, YouTube2Text) dataset and a Microsoft research-video to text (MSR-VTT) dataset. The experimental results show that a non-local block applied along a temporal axis can improve video captioning performance on video captioning datasets.

Keywords: video captioning; non-local mean; self-attention; video description.

DOI: 10.1504/IJCVR.2019.102288

International Journal of Computational Vision and Robotics, 2019 Vol.9 No.5, pp.502 - 514

Received: 31 May 2018
Accepted: 18 Sep 2018
Published online: 16 Sep 2019 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article

Title: Exploring the effects of non-local blocks on video captioning networks

Keep up-to-date