25 March 2020 Aligned attention for common multimodal embeddings
Shagan Sah, Sabarish Gopalakishnan, Raymond Ptucha
Author Affiliations +
Abstract

Deep learning has been attributed to incredible advances in computer vision, natural language processing, and general pattern understanding. Recent discoveries have enabled efficient vector representations of both visual and written stimuli. Robustly transferring between the two modalities remains a challenge that could yield benefits for search, retrieval, and storage applications. We introduce a simple, yet highly effective approach for building a connection space where natural language sentences are tightly coupled with visual data. In this connection space, similar concepts lie close, whereas dissimilar concepts lie far apart, irrespective of their modality. We introduce an attention mechanism to align multimodal embeddings that are learned through a multimodal metric loss function. We evaluate the learned common vector space on multiple image–text datasets—Pascal Sentences, NUS-WIDE-10k, XMediaNet, Flowers, and Caltech-UCSD Birds. We extend our method to five different modalities of image, sentence, audio, video, and 3D models to demonstrate cross-modal retrieval on the XMedia dataset. We obtain state-of-the-art retrieval and zero-shot retrieval across all datasets.

© 2020 SPIE and IS&T 1017-9909/2020/$28.00 © 2020 SPIE and IS&T
Shagan Sah, Sabarish Gopalakishnan, and Raymond Ptucha "Aligned attention for common multimodal embeddings," Journal of Electronic Imaging 29(2), 023013 (25 March 2020). https://doi.org/10.1117/1.JEI.29.2.023013
Received: 13 August 2019; Accepted: 6 March 2020; Published: 25 March 2020
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication and 2 patents.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Data modeling

Image retrieval

3D modeling

Computer programming

Visualization

Video

3D image processing

RELATED CONTENT


Back to Top