Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network

Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by fo...

Full description

Saved in:

Bibliographic Details
Main Authors:	Fengzheng Bi, Lidong Yang
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Sensors
Subjects:	acoustic scene classification wavelet transform visual state space KAN
Online Access:	https://www.mdpi.com/1424-8220/25/13/3930
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849320950250602496
author	Fengzheng Bi Lidong Yang
author_facet	Fengzheng Bi Lidong Yang
author_sort	Fengzheng Bi
collection	DOAJ
description	Acoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by focusing on three dimensions: the time dimension of the spectrogram, the frequency dimension, and the high- and low-frequency information extracted by a wavelet transform through a time–frequency–wavelet module. Multidimensional information was fused through the gated temporal–spatial attention unit, and the visual state space module was introduced to enhance the contextual modeling capability of audio sequences. In addition, Kolmogorov–Arnold network layers were used in place of multilayer perceptrons in the classifier part. The experimental results show that the proposed method achieves a 56.16% average accuracy on the TAU Urban Acoustic Scenes 2022 mobile development dataset, which is an improvement of 6.53% compared to the official baseline system. This performance improvement demonstrates the effectiveness of the model in complex scenarios. In addition, the accuracy of the proposed method on the UrbanSound8K dataset reached 97.60%, which is significantly better than the existing methods, further verifying the generalization ability of the proposed model in the acoustic scene classification task.
format	Article
id	doaj-art-6a7524a4726346ed9fc2852dddc58cb2
institution	Kabale University
issn	1424-8220
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj-art-6a7524a4726346ed9fc2852dddc58cb22025-08-20T03:49:55ZengMDPI AGSensors1424-82202025-06-012513393010.3390/s25133930Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion NetworkFengzheng Bi0Lidong Yang1School of Digital and Intelligent Industry, Inner Mongolia University of Science and Technology, Baotou 014010, ChinaSchool of Digital and Intelligent Industry, Inner Mongolia University of Science and Technology, Baotou 014010, ChinaAcoustic scene classification aims to recognize the scenes corresponding to sound signals in the environment, but audio differences from different cities and devices can affect the model’s accuracy. In this paper, a time–frequency–wavelet fusion network is proposed to improve model performance by focusing on three dimensions: the time dimension of the spectrogram, the frequency dimension, and the high- and low-frequency information extracted by a wavelet transform through a time–frequency–wavelet module. Multidimensional information was fused through the gated temporal–spatial attention unit, and the visual state space module was introduced to enhance the contextual modeling capability of audio sequences. In addition, Kolmogorov–Arnold network layers were used in place of multilayer perceptrons in the classifier part. The experimental results show that the proposed method achieves a 56.16% average accuracy on the TAU Urban Acoustic Scenes 2022 mobile development dataset, which is an improvement of 6.53% compared to the official baseline system. This performance improvement demonstrates the effectiveness of the model in complex scenarios. In addition, the accuracy of the proposed method on the UrbanSound8K dataset reached 97.60%, which is significantly better than the existing methods, further verifying the generalization ability of the proposed model in the acoustic scene classification task.https://www.mdpi.com/1424-8220/25/13/3930acoustic scene classificationwavelet transformvisual state spaceKAN
spellingShingle	Fengzheng Bi Lidong Yang Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network Sensors acoustic scene classification wavelet transform visual state space KAN
title	Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network
title_full	Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network
title_fullStr	Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network
title_full_unstemmed	Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network
title_short	Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network
title_sort	research on acoustic scene classification based on time frequency wavelet fusion network
topic	acoustic scene classification wavelet transform visual state space KAN
url	https://www.mdpi.com/1424-8220/25/13/3930
work_keys_str_mv	AT fengzhengbi researchonacousticsceneclassificationbasedontimefrequencywaveletfusionnetwork AT lidongyang researchonacousticsceneclassificationbasedontimefrequencywaveletfusionnetwork

Research on Acoustic Scene Classification Based on Time–Frequency–Wavelet Fusion Network

Similar Items