KEYWORDS: Visualization, Semantics, Optical character recognition, Education and training, Information visualization, Transformers, Data modeling, Object detection, Image enhancement, Buildings
Many existing image captioning methods only focus on image objects and their relationships for generating image captions, ignoring the text present in an image. Scene text (ST) contains crucial information to understand an image and facilitating reasoning. The existing methods fail to establish strong correlations between optical character recognition (OCR) tokens, as they have limited OCR representation power. Further, these methods have not efficiently used the positional information of the text. In this work, we have proposed an ST-based image captioning model (Trans-MAtt) based on a multilevel attention mechanism and relation network. We have used relation networks to enhance the connections between ST tokens. We have employed a multi-level attention method, which comprises of spatial, semantic, and appearance attention modules that precisely define the image. To represent context-enriched ST tokens, we use a combination of appearance, location, FastText, and PHOC features. We predict the ST location in the image, which is further integrated with the generated word embeddings for final caption generation. Experiments on the TextCaps dataset demonstrate the effectiveness of the proposed Trans-MAtt model, where it outperforms the current best model by 3.4% on B-4, 2.9% on METEOR, 3.3% on ROUGE-L, 3.1% on CIDEr-D, and 4.1% on SPICE metric scores. Our experiments on the Flickr30k and MSCOCO datasets demonstrated the superiority of our proposed model over existing methods.
With the remarkable success of the image captioning tasks, visual attention methods have become a vital part of captioning models. However, most attention-based image captioning methods do not consider any relationship among regions, which play a significant role in better image understanding. We proposed an image captioning method based on local relation network using a multilevel attention approach with graph neural network. It not only fully explores the relationship between the object and the image regions but also generates significant and context-based features corresponding to every region in the image. The attention employed in our work enhances the image representation capability of our method by focusing on a given image region and its related image regions. Thus addressing the relevant contextual information, spatial locations, and deep visual features leads to improve caption generation. We verified the effectiveness of the proposed model by conducting extensive experiments on three benchmark datasets: Flickr30k, MSCOCO, and nocaps. The results show the superiority of the proposed method over the existing methods both in quantitative and qualitative manners. Detailed ablation studies are conducted to communicate how each part would contribute to the final performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.