A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features

This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time...

Full description

Saved in:
Bibliographic Details
Main Authors: Muhammad Talha, Huma Ghafoor, Seung Yeob Nam
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11098792/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849772786930679808
author Muhammad Talha
Huma Ghafoor
Seung Yeob Nam
author_facet Muhammad Talha
Huma Ghafoor
Seung Yeob Nam
author_sort Muhammad Talha
collection DOAJ
description This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets.
format Article
id doaj-art-f3159bc1bedb42e1bf37d5aaa1fbefed
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-f3159bc1bedb42e1bf37d5aaa1fbefed2025-08-20T03:02:14ZengIEEEIEEE Access2169-35362025-01-011313382713383610.1109/ACCESS.2025.359344011098792A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical FeaturesMuhammad Talha0Huma Ghafoor1https://orcid.org/0000-0002-4640-4233Seung Yeob Nam2https://orcid.org/0000-0001-8249-4742School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanSchool of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanDepartment of Information and Communication Engineering, Yeungnam University, Gyeongsan, Gyeongsangbuk, South KoreaThis study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets.https://ieeexplore.ieee.org/document/11098792/Voice classificationconvolutional neural network (CNN)Mel spectrogramspectrogramstatistical features
spellingShingle Muhammad Talha
Huma Ghafoor
Seung Yeob Nam
A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
IEEE Access
Voice classification
convolutional neural network (CNN)
Mel spectrogram
spectrogram
statistical features
title A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_full A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_fullStr A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_full_unstemmed A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_short A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_sort unified approach to voice classification leveraging spectrograms mel spectrograms and statistical features
topic Voice classification
convolutional neural network (CNN)
Mel spectrogram
spectrogram
statistical features
url https://ieeexplore.ieee.org/document/11098792/
work_keys_str_mv AT muhammadtalha aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures
AT humaghafoor aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures
AT seungyeobnam aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures
AT muhammadtalha unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures
AT humaghafoor unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures
AT seungyeobnam unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures