w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training
Sound Event Localization and Detection (SELD) is a critical challenge in various industrial applications, such as autonomous systems, smart cities, and audio surveillance, which require accurate identification and localization of sound events in complex environments. Traditional supervised approache...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10772471/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850165451941740544 |
|---|---|
| author | Orlem Lima Dos Santos Karen Rosero Bruno Masiero Roberto de Alencar Lotufo |
| author_facet | Orlem Lima Dos Santos Karen Rosero Bruno Masiero Roberto de Alencar Lotufo |
| author_sort | Orlem Lima Dos Santos |
| collection | DOAJ |
| description | Sound Event Localization and Detection (SELD) is a critical challenge in various industrial applications, such as autonomous systems, smart cities, and audio surveillance, which require accurate identification and localization of sound events in complex environments. Traditional supervised approaches heavily rely on large, annotated multichannel audio datasets, which are expensive and time-consuming to produce. This paper addresses this limitation by introducing the w2v-SELD architecture, a self-supervised model adapted from the wav2vec 2.0 framework to learn effective sound event representations directly from raw, unlabeled 3D audio data in ambisonics format. The proposed model follows a two-stage process: pre-training on large, unlabeled 3D audio datasets to capture high-level features, followed by fine-tuning with a smaller, labeled SELD dataset. Experimental results show that our w2v-SELD method outperforms baseline models on Detection and Classification of Acoustic Scenes and Events (DCASE) challenges, achieving a 66% improvement for DCASE TAU-2019 and a 57% improvement on DCASE TAU-2020 with respect to baseline systems. The w2v-SELD model performs competitively with state-of-the-art supervised methods, highlighting its potential to significantly reduce the dependency on labeled data in industrial SELD applications. The code and pre-trained parameters of our w2v-SELD model are available online. |
| format | Article |
| id | doaj-art-3a4d9e8f878e43c28d86ff4a30d798e1 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-3a4d9e8f878e43c28d86ff4a30d798e12025-08-20T02:21:45ZengIEEEIEEE Access2169-35362024-01-011218155318156910.1109/ACCESS.2024.351045310772471w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-TrainingOrlem Lima Dos Santos0https://orcid.org/0000-0002-3942-6418Karen Rosero1https://orcid.org/0000-0002-8118-4213Bruno Masiero2https://orcid.org/0000-0002-2246-4450Roberto de Alencar Lotufo3https://orcid.org/0000-0002-5652-0852Department of Computer Engineering and Industrial Automation, University of Campinas, Campinas, BrazilDepartment of Electrical and Computer Engineering, The University of Texas at Dallas, Richardson, TX, USADepartment of Computer Engineering and Industrial Automation, University of Campinas, Campinas, BrazilDepartment of Computer Engineering and Industrial Automation, University of Campinas, Campinas, BrazilSound Event Localization and Detection (SELD) is a critical challenge in various industrial applications, such as autonomous systems, smart cities, and audio surveillance, which require accurate identification and localization of sound events in complex environments. Traditional supervised approaches heavily rely on large, annotated multichannel audio datasets, which are expensive and time-consuming to produce. This paper addresses this limitation by introducing the w2v-SELD architecture, a self-supervised model adapted from the wav2vec 2.0 framework to learn effective sound event representations directly from raw, unlabeled 3D audio data in ambisonics format. The proposed model follows a two-stage process: pre-training on large, unlabeled 3D audio datasets to capture high-level features, followed by fine-tuning with a smaller, labeled SELD dataset. Experimental results show that our w2v-SELD method outperforms baseline models on Detection and Classification of Acoustic Scenes and Events (DCASE) challenges, achieving a 66% improvement for DCASE TAU-2019 and a 57% improvement on DCASE TAU-2020 with respect to baseline systems. The w2v-SELD model performs competitively with state-of-the-art supervised methods, highlighting its potential to significantly reduce the dependency on labeled data in industrial SELD applications. The code and pre-trained parameters of our w2v-SELD model are available online.https://ieeexplore.ieee.org/document/10772471/Sound event localization and detectionself-supervised learningspatial audiowav2vec 2.0 |
| spellingShingle | Orlem Lima Dos Santos Karen Rosero Bruno Masiero Roberto de Alencar Lotufo w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training IEEE Access Sound event localization and detection self-supervised learning spatial audio wav2vec 2.0 |
| title | w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training |
| title_full | w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training |
| title_fullStr | w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training |
| title_full_unstemmed | w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training |
| title_short | w2v-SELD: A Sound Event Localization and Detection Framework for Self-Supervised Spatial Audio Pre-Training |
| title_sort | w2v seld a sound event localization and detection framework for self supervised spatial audio pre training |
| topic | Sound event localization and detection self-supervised learning spatial audio wav2vec 2.0 |
| url | https://ieeexplore.ieee.org/document/10772471/ |
| work_keys_str_mv | AT orlemlimadossantos w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining AT karenrosero w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining AT brunomasiero w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining AT robertodealencarlotufo w2vseldasoundeventlocalizationanddetectionframeworkforselfsupervisedspatialaudiopretraining |