This paper proposes a two-stage video scene segmentation method based on multimodal semantic interaction. The method divides the video scene segmentation task into two stages: shot audio-visual representation and multimodal scene segmentation. In the first stage, the method leverages the high correlation and complementarity between audio-visual information by using an interactive attention module to deeply explore audio-visual semantic information. Simultaneously, it introduces a self-supervised learning strategy to improve the model's generalization ability by utilizing the temporal structure characteristics of scenes. In the second stage, the method constructs a multimodal feature fusion module, learning a unified shot representation from the audio-visual representation based on the attention mechanism. Additionally, it builds a visual discrimination loss to regulate the influence of audio-visual features, further enhancing the discriminative power of shot representation. Experimental results on the MovieNet benchmark dataset show that the proposed method can achieve more accurate video scene segmentation.
|