Authors: Jaeyoung Lee; Junmo Kim
Addresses: School of Electrical Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea ' School of Electrical Engineering, Korea Advanced Institute of Science and Technology, 291 Daehak-ro, Yuseong-gu, Daejeon 34141, South Korea
Abstract: In addition to visual features, video also contains temporal information that contributes to semantic meaning regarding the relationships between objects and scenes. There have been many attempts to describe spatial and temporal relationships in video, but simple encoder-decoder models are not sufficient for capturing detailed relationships in video clips. A video clip often consists of several shots that seem to be unrelated, and simple recurrent models suffer from these changes in shots. In other fields, including visual question answering and action recognition, researchers began to have interests in describing visual relations between the objects. In this paper, we introduce a video captioning method to capture temporal relationships with a non-local block and boundary-aware system. We evaluate our approach on a Microsoft video description Corpus (MSVD, YouTube2Text) dataset and a Microsoft research-video to text (MSR-VTT) dataset. The experimental results show that a non-local block applied along a temporal axis can improve video captioning performance on video captioning datasets.
Keywords: video captioning; non-local mean; self-attention; video description.
International Journal of Computational Vision and Robotics, 2019 Vol.9 No.5, pp.502 - 514
Received: 17 Jun 2018
Accepted: 18 Sep 2018
Published online: 11 Sep 2019 *