w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training

Sound Event Localization and Detection (SELD) is a critical challenge in various industrial applications, such as autonomous systems, smart cities, and audio surveillance, which require accurate identification and localization of sound events in complex environments. Traditional supervised approache...

Full description

Saved in:
Bibliographic Details
Main Authors: Orlem Lima Dos Santos, Karen Rosero, Bruno Masiero, Roberto de Alencar Lotufo
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10772471/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850165451941740544
author Orlem Lima Dos Santos
Karen Rosero
Bruno Masiero
Roberto de Alencar Lotufo
author_facet Orlem Lima Dos Santos
Karen Rosero
Bruno Masiero
Roberto de Alencar Lotufo
author_sort Orlem Lima Dos Santos
collection DOAJ
description Sound Event Localization and Detection (SELD) is a critical challenge in various industrial applications, such as autonomous systems, smart cities, and audio surveillance, which require accurate identification and localization of sound events in complex environments. Traditional supervised approaches heavily rely on large, annotated multichannel audio datasets, which are expensive and time-consuming to produce. This paper addresses this limitation by introducing the w2v-SELD architecture, a self-supervised model adapted from the wav2vec 2.0 framework to learn effective sound event representations directly from raw, unlabeled 3D audio data in ambisonics format. The proposed model follows a two-stage process: pre-training on large, unlabeled 3D audio datasets to capture high-level features, followed by fine-tuning with a smaller, labeled SELD dataset. Experimental results show that our w2v-SELD method outperforms baseline models on Detection and Classification of Acoustic Scenes and Events (DCASE) challenges, achieving a 66% improvement for DCASE TAU-2019 and a 57% improvement on DCASE TAU-2020 with respect to baseline systems. The w2v-SELD model performs competitively with state-of-the-art supervised methods, highlighting its potential to significantly reduce the dependency on labeled data in industrial SELD applications. The code and pre-trained parameters of our w2v-SELD model are available online.
format Article
id doaj-art-3a4d9e8f878e43c28d86ff4a30d798e1
institution OA Journals
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-3a4d9e8f878e43c28d86ff4a30d798e12025-08-20T02:21:45ZengIEEEIEEE Access2169-35362024-01-011218155318156910.1109/ACCESS.2024.351045310772471w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-TrainingOrlem Lima Dos Santos0https://orcid.org/0000-0002-3942-6418Karen Rosero1https://orcid.org/0000-0002-8118-4213Bruno Masiero2https://orcid.org/0000-0002-2246-4450Roberto de Alencar Lotufo3https://orcid.org/0000-0002-5652-0852Department of Computer Engineering and Industrial Automation, University of Campinas, Campinas, BrazilDepartment of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USADepartment of Computer Engineering and Industrial Automation, University of Campinas, Campinas, BrazilDepartment of Computer Engineering and Industrial Automation, University of Campinas, Campinas, BrazilSound Event Localization and Detection (SELD) is a critical challenge in various industrial applications, such as autonomous systems, smart cities, and audio surveillance, which require accurate identification and localization of sound events in complex environments. Traditional supervised approaches heavily rely on large, annotated multichannel audio datasets, which are expensive and time-consuming to produce. This paper addresses this limitation by introducing the w2v-SELD architecture, a self-supervised model adapted from the wav2vec 2.0 framework to learn effective sound event representations directly from raw, unlabeled 3D audio data in ambisonics format. The proposed model follows a two-stage process: pre-training on large, unlabeled 3D audio datasets to capture high-level features, followed by fine-tuning with a smaller, labeled SELD dataset. Experimental results show that our w2v-SELD method outperforms baseline models on Detection and Classification of Acoustic Scenes and Events (DCASE) challenges, achieving a 66% improvement for DCASE TAU-2019 and a 57% improvement on DCASE TAU-2020 with respect to baseline systems. The w2v-SELD model performs competitively with state-of-the-art supervised methods, highlighting its potential to significantly reduce the dependency on labeled data in industrial SELD applications. The code and pre-trained parameters of our w2v-SELD model are available online.https://ieeexplore.ieee.org/document/10772471/Sound event localization and detectionself-supervised learningspatial audiowav2vec 2.0
spellingShingle Orlem Lima Dos Santos
Karen Rosero
Bruno Masiero
Roberto de Alencar Lotufo
w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
IEEE Access
Sound event localization and detection
self-supervised learning
spatial audio
wav2vec 2.0
title w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
title_full w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
title_fullStr w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
title_full_unstemmed w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
title_short w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
title_sort w2v seld a sound event localization and detection framework for self supervised spatial audio pre training
topic Sound event localization and detection
self-supervised learning
spatial audio
wav2vec 2.0
url https://ieeexplore.ieee.org/document/10772471/
work_keys_str_mv AT orlemlimadossantos w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining
AT karenrosero w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining
AT brunomasiero w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining
AT robertodealencarlotufo w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining