Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

Yakoub Bazi; Mohamad Mahmoud Al Rahhal; Mohamed Lamine Mekhalfi; Mansour Abdulaziz Al Zuair; Farid Melgani

doi:10.1109/TGRS.2022.3192460

Back

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

Journal article

Peer reviewed

Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery

Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Mohamed Lamine Mekhalfi, Mansour Abdulaziz Al Zuair and Farid Melgani

IEEE transactions on geoscience and remote sensing, Vol.60, pp.1-11

2022

DOI: https://doi.org/10.1109/TGRS.2022.3192460

Abstract

Co-attention

Computer vision

Feature extraction

Head

Remote sensing

self-attention

Task analysis

Transformers

vision-language models

visual question answering (VQA)

Visualization

Recently, vision-language models based on transformers are gaining popularity for joint modeling of visual and textual modalities. In particular, they show impressive results when transferred to several downstream tasks such as zero and few-shot classification. In this article, we propose a visual question answering (VQA) approach for remote sensing images based on these models. The VQA task attempts to provide answers to image-related questions. In contrast, VQA has gained popularity in computer vision, in remote sensing, it is not widespread. First, we use the contrastive language image pretraining (CLIP) network for embedding the image patches and question words into a sequence of visual and textual representations. Then, we learn attention mechanisms to capture the intradependencies and interdependencies within and between these representations. Afterward, we generate the final answer by averaging the predictions of two classifiers mounted on the top of the resulting contextual representations. In the experiments, we study the performance of the proposed approach on two datasets acquired with Sentinel-2 and aerial sensors. In particular, we demonstrate that our approach can achieve better results with reduced training size compared with the recent state-of-the-art.

Metrics

1 Record Views

Details

Title: Bi-Modal Transformer-Based Approach for Visual Question Answering in Remote Sensing Imagery
Creators - without role: Yakoub Bazi - King Saud University
Mohamad Mahmoud Al Rahhal - King Saud University
Mohamed Lamine Mekhalfi - Fondazione Bruno Kessler
Mansour Abdulaziz Al Zuair - King Saud University
Farid Melgani - University of Trento
Publication Details: IEEE transactions on geoscience and remote sensing, Vol.60, pp.1-11
Publisher: IEEE
Grant note: Research Center of College of Computer and Information Sciences, Deanship of Scientific Research, King Saud University, Riyadh, Saudi Arabia (10.13039/501100004745)
Identifiers: 9949028308331
Academic Unit: King Saud University
Language: English
Resource Type: Journal article