A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features

This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time...

Full description

Saved in:

Bibliographic Details
Main Authors:	Muhammad Talha, Huma Ghafoor, Seung Yeob Nam
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Voice classification convolutional neural network (CNN) Mel spectrogram spectrogram statistical features
Online Access:	https://ieeexplore.ieee.org/document/11098792/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849772786930679808
author	Muhammad Talha Huma Ghafoor Seung Yeob Nam
author_facet	Muhammad Talha Huma Ghafoor Seung Yeob Nam
author_sort	Muhammad Talha
collection	DOAJ
description	This study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets.
format	Article
id	doaj-art-f3159bc1bedb42e1bf37d5aaa1fbefed
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-f3159bc1bedb42e1bf37d5aaa1fbefed2025-08-20T03:02:14ZengIEEEIEEE Access2169-35362025-01-011313382713383610.1109/ACCESS.2025.359344011098792A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical FeaturesMuhammad Talha0Huma Ghafoor1https://orcid.org/0000-0002-4640-4233Seung Yeob Nam2https://orcid.org/0000-0001-8249-4742School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanSchool of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST), Islamabad, PakistanDepartment of Information and Communication Engineering, Yeungnam University, Gyeongsan, Gyeongsangbuk, South KoreaThis study presents a multi-input neural network architecture for voice classification that integrates two parallel convolutional neural networks (CNNs) for spectrogram and Mel spectrogram images, along with a fully connected dense network for six handpicked numerical statistical features from time domain signal. The outputs from these branches are flattened and merged, enabling the model to learn complementary patterns from both visual and numerical modalities. A new dataset, voice-18, was developed, consisting of one-second audio clips from 18 speakers across 18 classes. Extensive experiments evaluated the performance of individual and combined inputs. Results demonstrate that the multi-input model, particularly when using spectrograms, Mel spectrograms, and statistical features together, achieves the highest accuracy. The model best performed when all the three inputs were used together attained accuracies of <inline-formula> <tex-math notation="LaTeX">$0.9849~\pm ~0.0093$ </tex-math></inline-formula> on voice-18, <inline-formula> <tex-math notation="LaTeX">$0.8825~\pm ~0.0137$ </tex-math></inline-formula> on urban sound (US)8K, and <inline-formula> <tex-math notation="LaTeX">$0.9220~\pm ~0.0276$ </tex-math></inline-formula> on environmental sound classification (ESC)-50. While models trained solely on less than three inputs underperformed. These findings confirm the effectiveness of the proposed multimodal architecture for accurate voice and sound classification across different datasets.https://ieeexplore.ieee.org/document/11098792/Voice classificationconvolutional neural network (CNN)Mel spectrogramspectrogramstatistical features
spellingShingle	Muhammad Talha Huma Ghafoor Seung Yeob Nam A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features IEEE Access Voice classification convolutional neural network (CNN) Mel spectrogram spectrogram statistical features
title	A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_full	A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_fullStr	A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_full_unstemmed	A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_short	A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features
title_sort	unified approach to voice classification leveraging spectrograms mel spectrograms and statistical features
topic	Voice classification convolutional neural network (CNN) Mel spectrogram spectrogram statistical features
url	https://ieeexplore.ieee.org/document/11098792/
work_keys_str_mv	AT muhammadtalha aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT humaghafoor aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT seungyeobnam aunifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT muhammadtalha unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT humaghafoor unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures AT seungyeobnam unifiedapproachtovoiceclassificationleveragingspectrogramsmelspectrogramsandstatisticalfeatures

A Unified Approach to Voice Classification: Leveraging Spectrograms, Mel Spectrograms, and Statistical Features

Similar Items