AVCaps: An Audio-Visual Dataset With Modality-Specific Captions

AVCaps: An Audio-Visual Dataset With Modality-Specific Captions

This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual...

Full description

Saved in:

Bibliographic Details
Main Authors:	Parthasaarathy Sudarsanam, Irene Martin-Morato, Aapo Hakala, Tuomas Virtanen
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Open Journal of Signal Processing
Subjects:	AVCaps audio-visual captioning dataset multimodal retrieval
Online Access:	https://ieeexplore.ieee.org/document/11029114/
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Tiny TR-CAP: A novel small-scale benchmark dataset for general-purpose image captioning tasks
by: Abbas Memiş, et al.
Published: (2025-04-01)

Bi-Modal Multiperspective Percussive (BiMP) Dataset for Visual and Audio Human Fall Detection
by: Joe Dibble, et al.
Published: (2025-01-01)

Survey of Dense Video Captioning: Techniques, Resources, and Future Perspectives
by: Zhandong Liu, et al.
Published: (2025-04-01)

DanceCaps: Pseudo-Captioning for Dance Videos Using Large Language Models
by: Seohyun Kim, et al.
Published: (2024-11-01)

MusiQAl: A Dataset for Music Question–Answering through Audio–Video Fusion
by: Anna-Maria Christodoulou, et al.
Published: (2025-07-01)

Audio-Language Datasets of Scenes and Events: A Survey
by: Gijs Wijngaard, et al.
Published: (2025-01-01)

Improving Visual Question Answering by Image Captioning
by: Xiangjun Shao, et al.
Published: (2025-01-01)

Visual Content Captioning and Audio Conversion using CNN-RNN with Attention Model
by: Aldy Agil Hermanto, et al.
Published: (2025-06-01)

Enhanced CLIP-GPT Framework for Cross-Lingual Remote Sensing Image Captioning
by: Rui Song, et al.
Published: (2025-01-01)

Integrating visual memory for image captioning
by: Jiahui Wei, et al.
Published: (2025-05-01)

Restricted supervised Cascade Information Network for remote sensing change captioning with serial sentences
by: Kunping Yang, et al.
Published: (2025-08-01)

Fit for What Purpose? NER Certification of Automatic Captions in English and Spanish
by: Pablo Romero-Fresco, et al.
Published: (2025-01-01)

HI4HC and AAAAD: Exploring a hierarchical method and dataset using hybrid intelligence for remote sensing scene captioning
by: Jiaxin Ren, et al.
Published: (2025-05-01)

Dual-Stream Spatially Aware Transformer for Remote Sensing Image Captioning
by: Haifeng Sima, et al.
Published: (2025-01-01)

Undergraduate students’ perceptions toward writing Instagram captions in English
by: Nahda Nafisah Hutasuhut, et al.
Published: (2024-05-01)

Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms
by: Shintaro Ishikawa, et al.
Published: (2023-01-01)

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model
by: Yue Yang, et al.
Published: (2024-11-01)

3M: Multi-style image caption generation using Multi-modality features under Multi-UPDOWN model
by: Chengxi Li, et al.
Published: (2021-04-01)

Towards an ‘Everything Corpus’: A Framework and Guidelines for the Curation of More Comprehensive Multimodal Music Data
by: Mark Gotham, et al.
Published: (2025-05-01)

Image Captioning Based on Semantic Scenes
by: Fengzhi Zhao, et al.
Published: (2024-10-01)

Recording artist career comparison through audio content analysis
by: Nick Collins
Published: (2025-07-01)

Fine-Grained Length Controllable Video Captioning With Ordinal Embeddings
by: Tomoya Nitta, et al.
Published: (2024-01-01)

Content moderation assistance through image caption generation
by: Liam Kearns
Published: (2025-03-01)

A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow
by: Zhenqiang Zhao, et al.
Published: (2025-06-01)

Enhanced group relation learning via aligned attention masking for fashion product captioning
by: Yuhao Tang, et al.
Published: (2025-08-01)

Semantic-Guided Selective Representation for Image Captioning
by: Yinan Li, et al.
Published: (2023-01-01)

NuCap: A Numerically Aware Captioning Framework for Improved Numerical Reasoning
by: Yuna Jeong, et al.
Published: (2025-05-01)

THE NECESSITY OF PRODUCING A HIGH-QUALITY TRANSLATION OF CAPTIONS IN RADYA PUSTAKA MUSEUM
by: Dyah Ayu Nila Khrisna, et al.
Published: (2021-04-01)

Offline visual aid system for the blind based on image captioning
by: Yue CHEN, et al.
Published: (2022-01-01)

Contrastive learning based remote sensing text-to-image generation for few-shot remote sensing image captioning
by: Haonan Zhou, et al.
Published: (2025-08-01)

Hierarchical cross-modal attention and dual audio pathways for enhanced multimodal sentiment analysis
by: D. Vamsidhar, et al.
Published: (2025-07-01)

Detailed Image Captioning and Hashtag Generation
by: Nikshep Shetty, et al.
Published: (2024-11-01)

Audio–Visual Synchronization and Lip Movement Analysis for Real-Time Deepfake Detection
by: Muhammad Javed, et al.
Published: (2025-07-01)

Nonspeech7k dataset: Classification and analysis of human non‐speech sound
by: Muhammad Mamunur Rashid, et al.
Published: (2023-06-01)

Deep Learning Approach for Music Genre Classification using Multi-Feature Audio Representations
by: Nurul Asanah, et al.
Published: (2025-09-01)

Improved IEC performance via emotional stimuli-aware captioning
by: Zibo Zhou, et al.
Published: (2025-07-01)

DISCURSIVE AND SOCIAL PRACTICES IN INSTAGRAM CAPTIONS: EVIDENCE FROM INDONESIA
by: Hidayana Putri, et al.
Published: (2022-04-01)

Listen or Read? The Impact of Proficiency and Visual Complexity on Learners’ Reliance on Captions
by: Yan Li
Published: (2025-04-01)

A novel image captioning model with visual-semantic similarities and visual representations re-weighting
by: Alaa Thobhani, et al.
Published: (2024-09-01)

Combining Region-Guided Attention and Attribute Prediction for Thangka Image Captioning Method
by: Fujun Zhang, et al.
Published: (2025-01-01)