AVCaps: An Audio-Visual Dataset With Modality-Specific Captions

This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual...

Full description

Saved in:
Bibliographic Details
Main Authors: Parthasaarathy Sudarsanam, Irene Martin-Morato, Aapo Hakala, Tuomas Virtanen
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Open Journal of Signal Processing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11029114/
Tags: Add Tag
No Tags, Be the first to tag this record!

Similar Items