AVCaps: An Audio-Visual Dataset With Modality-Specific Captions
This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual...
Saved in:
| Main Authors: | Parthasaarathy Sudarsanam, Irene Martin-Morato, Aapo Hakala, Tuomas Virtanen |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Open Journal of Signal Processing |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11029114/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
-
Tiny TR-CAP: A novel small-scale benchmark dataset for general-purpose image captioning tasks
by: Abbas Memiş, et al.
Published: (2025-04-01) -
Bi-Modal Multiperspective Percussive (BiMP) Dataset for Visual and Audio Human Fall Detection
by: Joe Dibble, et al.
Published: (2025-01-01) -
Survey of Dense Video Captioning: Techniques, Resources, and Future Perspectives
by: Zhandong Liu, et al.
Published: (2025-04-01) -
DanceCaps: Pseudo-Captioning for Dance Videos Using Large Language Models
by: Seohyun Kim, et al.
Published: (2024-11-01) -
MusiQAl: A Dataset for Music Question–Answering through Audio–Video Fusion
by: Anna-Maria Christodoulou, et al.
Published: (2025-07-01)