An Ensemble of Convolutional Neural Networks for Sound Event Detection
Sound event detection tasks are rapidly advancing in the field of pattern recognition, and deep learning methods are particularly well suited for such tasks. One of the important directions in this field is to detect the sounds of emotional events around residential buildings in smart cities and qui...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Mathematics |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2227-7390/13/9/1502 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850032041211461632 |
|---|---|
| author | Abdinabi Mukhamadiyev Ilyos Khujayarov Dilorom Nabieva Jinsoo Cho |
| author_facet | Abdinabi Mukhamadiyev Ilyos Khujayarov Dilorom Nabieva Jinsoo Cho |
| author_sort | Abdinabi Mukhamadiyev |
| collection | DOAJ |
| description | Sound event detection tasks are rapidly advancing in the field of pattern recognition, and deep learning methods are particularly well suited for such tasks. One of the important directions in this field is to detect the sounds of emotional events around residential buildings in smart cities and quickly assess the situation for security purposes. This research presents a comprehensive study of an ensemble convolutional recurrent neural network (CRNN) model designed for sound event detection (SED) in residential and public safety contexts. The work focuses on extracting meaningful features from audio signals using image-based representation, such as Discrete Cosine Transform (DCT) spectrograms, Cocheagrams, and Mel spectrograms, to enhance robustness against noise and improve feature extraction. In collaboration with police officers, a two-hour dataset consisting of 112 clips related to four classes of emotional sounds, such as harassment, quarrels, screams, and breaking sounds, was prepared. In addition to the crowdsourced dataset, publicly available datasets were used to broaden the study’s applicability. Our dataset contains 5055 audio files of different lengths totaling 14.14 h and strongly labeled data. The dataset consists of 13 separate sound categories. The proposed CRNN model integrates spatial and temporal feature extraction by processing these spectrograms through convolution and bi-directional gated recurrent unit (GRU) layers. An ensemble approach combines predictions from three models, achieving F1 scores of 71.5% for segment-based metrics and 46% for event-based metrics. The results demonstrate the model’s effectiveness in detecting sound events under noisy conditions, even with a small, unbalanced dataset. This research highlights the potential of the model for real-time audio surveillance systems using mini-computers, offering cost-effective and accurate solutions for maintaining public order. |
| format | Article |
| id | doaj-art-2c5eea6e529e48849304c4aebbb71533 |
| institution | DOAJ |
| issn | 2227-7390 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Mathematics |
| spelling | doaj-art-2c5eea6e529e48849304c4aebbb715332025-08-20T02:58:47ZengMDPI AGMathematics2227-73902025-05-01139150210.3390/math13091502An Ensemble of Convolutional Neural Networks for Sound Event DetectionAbdinabi Mukhamadiyev0Ilyos Khujayarov1Dilorom Nabieva2Jinsoo Cho3Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of KoreaDepartment of Information Technologies, Samarkand Branch of Tashkent University of Information Technologies Named After Muhammad al-Khwarizmi, Tashkent 100084, UzbekistanDepartment of Information Technologies, Samarkand Branch of Tashkent University of Information Technologies Named After Muhammad al-Khwarizmi, Tashkent 100084, UzbekistanDepartment of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of KoreaSound event detection tasks are rapidly advancing in the field of pattern recognition, and deep learning methods are particularly well suited for such tasks. One of the important directions in this field is to detect the sounds of emotional events around residential buildings in smart cities and quickly assess the situation for security purposes. This research presents a comprehensive study of an ensemble convolutional recurrent neural network (CRNN) model designed for sound event detection (SED) in residential and public safety contexts. The work focuses on extracting meaningful features from audio signals using image-based representation, such as Discrete Cosine Transform (DCT) spectrograms, Cocheagrams, and Mel spectrograms, to enhance robustness against noise and improve feature extraction. In collaboration with police officers, a two-hour dataset consisting of 112 clips related to four classes of emotional sounds, such as harassment, quarrels, screams, and breaking sounds, was prepared. In addition to the crowdsourced dataset, publicly available datasets were used to broaden the study’s applicability. Our dataset contains 5055 audio files of different lengths totaling 14.14 h and strongly labeled data. The dataset consists of 13 separate sound categories. The proposed CRNN model integrates spatial and temporal feature extraction by processing these spectrograms through convolution and bi-directional gated recurrent unit (GRU) layers. An ensemble approach combines predictions from three models, achieving F1 scores of 71.5% for segment-based metrics and 46% for event-based metrics. The results demonstrate the model’s effectiveness in detecting sound events under noisy conditions, even with a small, unbalanced dataset. This research highlights the potential of the model for real-time audio surveillance systems using mini-computers, offering cost-effective and accurate solutions for maintaining public order.https://www.mdpi.com/2227-7390/13/9/1502smart citysound event detectionaudio signaldata augmentationensemble of classifierspattern recognition |
| spellingShingle | Abdinabi Mukhamadiyev Ilyos Khujayarov Dilorom Nabieva Jinsoo Cho An Ensemble of Convolutional Neural Networks for Sound Event Detection Mathematics smart city sound event detection audio signal data augmentation ensemble of classifiers pattern recognition |
| title | An Ensemble of Convolutional Neural Networks for Sound Event Detection |
| title_full | An Ensemble of Convolutional Neural Networks for Sound Event Detection |
| title_fullStr | An Ensemble of Convolutional Neural Networks for Sound Event Detection |
| title_full_unstemmed | An Ensemble of Convolutional Neural Networks for Sound Event Detection |
| title_short | An Ensemble of Convolutional Neural Networks for Sound Event Detection |
| title_sort | ensemble of convolutional neural networks for sound event detection |
| topic | smart city sound event detection audio signal data augmentation ensemble of classifiers pattern recognition |
| url | https://www.mdpi.com/2227-7390/13/9/1502 |
| work_keys_str_mv | AT abdinabimukhamadiyev anensembleofconvolutionalneuralnetworksforsoundeventdetection AT ilyoskhujayarov anensembleofconvolutionalneuralnetworksforsoundeventdetection AT diloromnabieva anensembleofconvolutionalneuralnetworksforsoundeventdetection AT jinsoocho anensembleofconvolutionalneuralnetworksforsoundeventdetection AT abdinabimukhamadiyev ensembleofconvolutionalneuralnetworksforsoundeventdetection AT ilyoskhujayarov ensembleofconvolutionalneuralnetworksforsoundeventdetection AT diloromnabieva ensembleofconvolutionalneuralnetworksforsoundeventdetection AT jinsoocho ensembleofconvolutionalneuralnetworksforsoundeventdetection |