MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Mattia Soldan; Alejandro Pardo; Juan Leon Alcazar; Fabian Caba Heilbron; Chen Zhao; Silvio Giancola; Bernard Ghanem; IEEE COMP SOC

doi:10.1109/CVPR52688.2022.00497

Conference proceeding

MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions

Mattia Soldan, Alejandro Pardo, Juan Leon Alcazar, Fabian Caba Heilbron, Chen Zhao, Silvio Giancola, Bernard Ghanem and IEEE COMP SOC

2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), Vol.2022-, pp.5016-5025

IEEE Conference on Computer Vision and Pattern Recognition

01/01/2022

DOI: https://doi.org/10.1109/CVPR52688.2022.00497

Abstract

Computer Science

Computer Science, Artificial Intelligence

Imaging Science & Photographic Technology

Science & Technology

Technology

The recent and increasing interest in video-language research has driven the development of large-scale datasets that enable data-intensive machine learning techniques. In comparison, limited effort has been made at assessing the fitness of these datasets for the video-language grounding task. Recent works have begun to discover significant limitations in these datasets, suggesting that state-ofthe-art techniques commonly overfit to hidden dataset biases. In this work, we present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations and focuses on crawling and aligning available audio descriptions of mainstream movies. MAD contains over 384; 000 natural language sentences grounded in over 1; 200 hours of videos and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets. MAD's collection strategy enables a novel and more challenging version of video-language grounding, where short temporal moments (typically seconds long) must be accurately grounded in diverse long-form videos that can last up to three hours. We have released MAD's data and baselines code at https:// github. com/ Soldelli/MAD.

Metrics

1 Record Views

Details

Title: MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions
Creators - without role: Mattia Soldan - King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
Alejandro Pardo - King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
Juan Leon Alcazar - King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
Fabian Caba Heilbron - Adobe Res, San Jose, CA USA
Chen Zhao - King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
Silvio Giancola - King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
Bernard Ghanem - King Abdullah Univ Sci & Technol KAUST, Thuwal, Saudi Arabia
IEEE COMP SOC
Publication Details: 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), Vol.2022-, pp.5016-5025
Series: IEEE Conference on Computer Vision and Pattern Recognition
Publisher: IEEE
Number of pages: 10
Grant note: King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC)
Identifiers: 9944223808331
Academic Unit: King Abdullah University of Science & Technology
Language: English
Resource Type: Conference proceeding