A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11098792/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849772786930679808 |
|---|---|
| author | Muhammad Talha Huma Ghafoor Seung Yeob Nam |
| author_facet | Muhammad Talha Huma Ghafoor Seung Yeob Nam |
| author_sort | Muhammad Talha |
| collection | DOAJ |
| description | This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets. |
| format | Article |
| id | doaj-art-f3159bc1bedb42e1bf37d5aaa1fbefed |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-f3159bc1bedb42e1bf37d5aaa1fbefed2025-08-20T03:02:14ZengIEEEIEEE Access2169-35362025-01-011313382713383610.1109/ACCESS.2025.359344011098792A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical FeaturesMuhammad Talha0Huma Ghafoor1https://orcid.org/0000-0002-4640-4233Seung Yeob Nam2https://orcid.org/0000-0001-8249-4742School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanSchool of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanDepartment of Information and Communication Engineering, Yeungnam University, Gyeongsan, Gyeongsangbuk, South KoreaThis study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets.https://ieeexplore.ieee.org/document/11098792/Voice classificationconvolutional neural network (CNN)Mel spectrogramspectrogramstatistical features |
| spellingShingle | Muhammad Talha Huma Ghafoor Seung Yeob Nam A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features IEEE Access Voice classification convolutional neural network (CNN) Mel spectrogram spectrogram statistical features |
| title | A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features |
| title_full | A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features |
| title_fullStr | A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features |
| title_full_unstemmed | A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features |
| title_short | A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features |
| title_sort | unified approach to voice classification leveraging spectrograms mel spectrograms and statistical features |
| topic | Voice classification convolutional neural network (CNN) Mel spectrogram spectrogram statistical features |
| url | https://ieeexplore.ieee.org/document/11098792/ |
| work_keys_str_mv | AT muhammadtalha aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT humaghafoor aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT seungyeobnam aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT muhammadtalha unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT humaghafoor unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT seungyeobnam unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures |