Presentation + Paper
20 August 2020 Video captioning using weakly supervised convolutional neural networks
Author Affiliations +
Abstract
The video captioning problem consists of describing a short video clip with natural language. Existing solutions tend to rely on extracting features from frames or sets of frames with pretrained and fixed Convolutional Neural Networks (CNNs). Traditionally, the CNNs are pretrained on the ImageNet-1K (IN1K) classification task. The features are then fed into a sequence-to-sequence model to produce the text description output. In this paper, we propose using Facebook's ResNeXt Weakly Supervised Learning (WSL) CNNs as fixed feature extractors for video captioning. These CNNs are trained on billion-scale weakly supervised datasets constructed from Instagram image-hashtag pairs and then fine-tuned on IN1K. Whereas previous works use complicated architectures or multimodal features, we demonstrate state-of-the-art performance on the Microsoft Video Description (MSVD) dataset and competitive results on the Microsoft Research-Video to Text (MSR-VTT) dataset using only the frame-level features from the new CNNs and a basic Transformer as a sequence-to-sequence model. Moreover, our results validate that CNNs pretrained with weak supervision can effectively transfer to tasks other than classification. Finally, we present results for a number of IN1K feature extractors and discuss the relationship between IN1K accuracy and video captioning performance. Code will be made available at https://github. com/flauted/OpenNMT-py.
Conference Presentation
© (2020) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Dylan Flaute and Barath Narayanan Narayanan "Video captioning using weakly supervised convolutional neural networks", Proc. SPIE 11511, Applications of Machine Learning 2020, 1151106 (20 August 2020); https://doi.org/10.1117/12.2568016
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

Transformers

Convolutional neural networks

Video processing

Computer programming

RELATED CONTENT

Motion-aware deep video coding network
Proceedings of SPIE (April 21 2020)
High dynamic range subjective testing
Proceedings of SPIE (September 27 2016)
Scalable hierarchical video summary and search
Proceedings of SPIE (January 01 2001)
Layer thickness in congestion-controlled scalable video
Proceedings of SPIE (January 19 2009)

Back to Top