MAIN: Multi-Attention Instance Network for video segmentation

Juan León Alcázar; María A. Bravo; Guillaume Jeanneret; Ali K. Thabet; Thomas Brox; Pablo Arbeláez; Bernard Ghanem

doi:10.1016/j.cviu.2021.103240

Back

MAIN: Multi-Attention Instance Network for video segmentation

Journal article

Peer reviewed

MAIN: Multi-Attention Instance Network for video segmentation

Juan León Alcázar, María A. Bravo, Guillaume Jeanneret, Ali K. Thabet, Thomas Brox, Pablo Arbeláez and Bernard Ghanem

Computer vision and image understanding, Vol.210, p.103240

09/2021

DOI: https://doi.org/10.1016/j.cviu.2021.103240

Abstract

Attention mechanism

Deep learning

Video object segmentation

Instance-level video segmentation requires a solid integration of spatial and temporal information. However, current methods rely mostly on domain-specific information (online learning) to produce accurate instance-level segmentations. We propose a novel approach that relies exclusively on the integration of generic spatio-temporal attention cues. Our strategy, named Multi-Attention Instance Network (MAIN), overcomes challenging segmentation scenarios over arbitrary videos without modeling sequence- or instance-specific knowledge. We design MAIN to segment multiple instances in a single forward pass, and optimize it with a novel loss function that favors class agnostic predictions and assigns instance-specific penalties. We achieve state-of-the-art performance on the challenging Youtube-VOS dataset and benchmark, improving the unseen Jaccard and F-Metric by 6.8% and 12.7% respectively, while operating at real-time (30.3 FPS). •MAIN directly addresses the multi-instance scenario.•MAIN generates multi-instance segmentations in a single forward pass.•MAIN operates without domain-specific knowledge.•We introduce a novel loss function for the multi-instance segmentation scenario.•Our loss function addresses imbalanced datasets that contain multiple instances.•MAIN uses static and temporal grouping cues into a single architecture.•MAIN fuses short-term and long-term grouping cues into a single architecture•A dilated separable decoder allows to efficiently aggregate multi-scale information

Metrics

1 Record Views

See more details

Details

Title: MAIN: Multi-Attention Instance Network for video segmentation
Creators - without role: Juan León Alcázar - King Abdullah University of Science and Technology
María A. Bravo - University of Freiburg
Guillaume Jeanneret - Universidad de los Andes, Bogota, Colombia
Ali K. Thabet - King Abdullah University of Science and Technology
Thomas Brox - University of Freiburg
Pablo Arbeláez - Universidad de Los Andes
Bernard Ghanem - King Abdullah University of Science and Technology
Publication Details: Computer vision and image understanding, Vol.210, p.103240
Publisher: Elsevier Inc
Identifiers: 9945827908331
Academic Unit: King Abdullah University of Science & Technology
Language: English
Resource Type: Journal article