A video retrieval algorithm with embedded position coding for moment localization and highlight detection

Fei Song; Yude Wang; Ting Zhao; Teng Liu

doi:10.1117/12.3055603

2 December 2024 A video retrieval algorithm with embedded position coding for moment localization and highlight detection

Fei Song, Yude Wang, Ting Zhao, Teng Liu

Author Affiliations +

Proceedings Volume 13443, Fifth International Conference on Computer Vision and Information Technology (CVIT 2024); 1344306 (2024) https://doi.org/10.1117/12.3055603
Event: 2024 5th International Conference on Computer Vision and Information Technology (CVIT 2024), 2024, Beijing, China

Abstract

To solve the problem of incomplete acquisition of context information in mode in existing video retrieval algorithms, this paper proposes to introduce sine-cosine 2D position coding into single mode coding to capture fine-grained local details and improve the detection accuracy of moment localization and highlight detection algorithms. First, the video and text are processed to extract the visual features, audio features and text features. The visual and audio features are input into the single mode coding embedded in the 2D position of sine and cosine respectively to capture the global time relationship, and then multi-mode fusion is carried out. The fused features and text features are then used to generate always-aligned queries. Finally, the query decoding and prediction methods are used to obtain the results of moment localization and highlight detection, and the video retrieval is completed. The proposed method was experimentally verified on four datasets: QVHighlights, Charades-STA, TVSum and YouTube Highlights. On QVHighlights dataset, the mAP indexes of moment localization and highlight detection tasks reached 39.09 and 39.68, respectively. On the Charades-STA dataset, the Recall@1 index with IoU threshold of 0.5 reached 51.13. The mAP indicator on the TVSum and YouTube Highlights datasets reached 83.7 and 75.3, respectively. The research work of this paper provides theoretical support for the realization of video retrieval technology.

(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.

Citation Download Citation

Fei Song, Yude Wang, Ting Zhao, and Teng Liu "A video retrieval algorithm with embedded position coding for moment localization and highlight detection", Proc. SPIE 13443, Fifth International Conference on Computer Vision and Information Technology (CVIT 2024), 1344306 (2 December 2024); https://doi.org/10.1117/12.3055603

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available