Abstract
Recently, multimodal retrieval has attracted increasing attention in remote sensing community. in particular, text-image retrieval showed a promising research topic due to its ability to enable a flexible retrieval experience. To achieve this end, we propose a transformer-based multilingual textimage retrieval approach. Specifically, we employ the transformer encoder for both textual and visual modalities. At the text encoder, we jointly train four languages: English, Arabic, French, and Italian. We conduct experiments on fine-grained multimodal datasets named RSITMD. The experimental studies of the proposed method demonstrate its superior performance on both single and multiple languages modalities compared with the state-of-the-art methods.