DM-CATN: Deep Modular Co-Attention Transformer Networks for image captioning

Xingjian Wang; Xiaolong Fang; You Yang

doi:10.1117/12.2659580

30 November 2022 DM-CATN: Deep Modular Co-Attention Transformer Networks for image captioning

Xingjian Wang, Xiaolong Fang, You Yang

Proceedings Volume 12456, International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP 2022); 124562O (2022) https://doi.org/10.1117/12.2659580
Event: International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP 2022), 2022, Qingdao, China

Abstract

The image features are directly input into the decoding part of the model, which leads to the insufficient use of feature information and makes it difficult for the model to better express the image information. We introduce a Modular Co- Attention Transformer Layer (M-CATL) to efficiently model high-order intra-feature and inter-feature interactions for single and multiple input features to mine the details of image features. And construct a Deep Modular Co-Attention Transformer Block (DM-CATB) according to M-CATL and integrated into the encoder part of the model. Furthermore, we present a Deep Modular Co-Attention Transformer Network (DM-CATN) to fully model the spatial information and position information of image features and improve the ability of features characterization, in order to provide richer image information for decoding part. Experimental results demonstrate that DM-CATN significantly outperforms the previous state-of-the-art. Our best single model delivers 133.2% in CIDEr.

Citation Download Citation

Xingjian Wang, Xiaolong Fang, and You Yang "DM-CATN: Deep Modular Co-Attention Transformer Networks for image captioning", Proc. SPIE 12456, International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP 2022), 124562O (30 November 2022); https://doi.org/10.1117/12.2659580

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
7 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Transformers

Visualization

Image fusion

Performance modeling

Computer programming

Matrices

Network architectures

Show All Keywords

Keywords/Phrases

Search In:

Publication Years