Abstract
In this paper, we present an open-ended visual question answering (VQA) model for remote sensing images, where the answers can be given in the form of short sentences, unlike closed-ended VQA. This model uses a vision and natural language transformers for embedding the image and its related question. The feature representations obtained at the output are concatenated and fed to a light transformer decoder for generating the answer in an autoregressive way. The complete architecture is trained in an end-to-end manner via the backpropagation algorithm. In the experiments, we evaluate the model on a manually labeled open-ended VQA dataset termed TextRS composed of 6245 image-question pairs.