Bidirectional LSTM approach to image captioning with scene features

Davis Agughalam; Pramod Pathak; Paul Stynes

doi:10.1117/12.2600465

30 June 2021 Bidirectional LSTM approach to image captioning with scene features

Davis Agughalam, Pramod Pathak, Paul Stynes

Proceedings Volume 11878, Thirteenth International Conference on Digital Image Processing (ICDIP 2021); 118780B (2021) https://doi.org/10.1117/12.2600465
Event: Thirteenth International Conference on Digital Image Processing, 2021, Singapore, Singapore

Abstract

Image captioning involves generating a sentence that describes an image. More recently, it has been driven by encoderdecoder approaches where the encoder such as convolutional neural network (CNN) can extract the visual features of an image. The extracted visual features are passed to a decoder such as a long short-term memory (LSTM) network in order to generate a sentence that describes the image. One major challenge with this approach is to precisely include the scene of an image in the generated sentences. To resolve this challenge, visual scene features have been used with unidirectional LSTM decoders. However, for long sentences, this limits the precision of the generated text. This research proposes a novel approach to generate sentences using visual scene information with a bidirectional LSTM decoder. The encoder is based on Inception v3 to extract the object features and Places365 to extract the scene features. The decoder uses a bidirectional LSTM to generate a sentence. The encoder-decoder model is trained using the Flickr8k dataset. Results show improved performance for generating longer sentences with a 9% increase in BLEU-3 and a 12% increase in BLEU-4 scores compared to compared to other encoder-decoder methods that are limited to only using global image features. Visually impaired people that use screen readers would benefit from this research as they would get an enhanced description of an image that includes the background scene thereby creating a wholesome picture in the mind of the reader.

Citation Download Citation

Davis Agughalam, Pramod Pathak, and Paul Stynes "Bidirectional LSTM approach to image captioning with scene features", Proc. SPIE 11878, Thirteenth International Conference on Digital Image Processing (ICDIP 2021), 118780B (30 June 2021); https://doi.org/10.1117/12.2600465

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available