Abstract
With the rapid increase of multimodal remote sensing (RS) data, cross-modal retrieval maximally benefits us to give us more flexibility in image retrieval tasks. However, image retrieval across different modalities is still an open challenge in RS community. Inspired by the recent achievement of the transformers on natural language processing and computer vision applications, we present a transformer-based method for text-image retrieval tasks, which consists of separate encoders for textual and visual features. Specifically, we adopted Arabic and English captions at the text modality. Afterward, we investigate two paradigms. In the first paradigm. We consider learning each language independently. In the second paradigm, we jointly learned both Arabic and English languages. The experimental results on two cross-modal confirm the promising capabilities of the proposed method.