Toubal, Imad Eddine;Chen, Yi-Ting;Viswanathan, Krishnamurthy;Salz, Daniel;Xia, Ye;Ding, Zhongli;

Multi-Modal Dual-Tower Architectures for Entity Retrieval from Image and Text.

IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2023.

We explore different neural architectures for entity re-trieval using multi-modal (image and text) input. Dual-tower architectures use two network to embed the input and the target into a joint latent space. This is in contrast to single-tower architectures that perform entity retrieval as a classification problem. We propose novel dual-tower multi-modal networks that use a shared tower to encode both modalities in contrast to uni-modal architectures that use separate uni-modal towers to encode images and text separately. We investigate the impact of using high-quality versus noisy text during train and test-time on the perfor- mance of these models. We train our networks on a large weakly-labeled multi-modal data scraped from the public domain and we evaluate on publicly available benchmark- ing datasets (namely COCO Captions, Open Images, and Wikipedia Image Text). Our findings suggest that adding high-quality text improves the performance of both single- tower and dual-tower architectures compared to using noisy text. Moreover, our experiments show that dual-tower ar- chitectures need no further training compared to single- tower. Finally, we provide a comparison of the proposed models in comparison to state-of-the-art architectures.

© 2023 Imad Toubal.