Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN

Abstract In today’s society, speaker recognition plays an increasingly important role. Currently, neural networks are widely employed for extracting speaker features. Although the Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) model can obtain te...

Full description

Saved in:
Bibliographic Details
Main Authors: Pinyan Li, Lap Man Hoi, Yapeng Wang, Xu Yang, Sio Kei Im
Format: Article
Language:English
Published: SpringerOpen 2025-02-01
Series:EURASIP Journal on Audio, Speech, and Music Processing
Subjects:
Online Access:https://doi.org/10.1186/s13636-025-00396-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850087399195934720
author Pinyan Li
Lap Man Hoi
Yapeng Wang
Xu Yang
Sio Kei Im
author_facet Pinyan Li
Lap Man Hoi
Yapeng Wang
Xu Yang
Sio Kei Im
author_sort Pinyan Li
collection DOAJ
description Abstract In today’s society, speaker recognition plays an increasingly important role. Currently, neural networks are widely employed for extracting speaker features. Although the Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) model can obtain temporal context information through dilated convolution to some extent, this model falls short in acquiring fully comprehensive speech features. To further improve the accuracy of the model, better capture the temporal context information, and make ECAPA-TDNN unaffected by small offsets in the frequency domain, based on the ECAPA-TDNN model, we combine a two-dimensional convolutional network (Conv2D), a residual network (ResNet), and ECAPA-TDNN to form a novel CRET model. In this study, two CRET models are proposed, and these two models are compared with the baseline models Multi-Scale Backbone Architecture (Res2Net) and ECAPA-TDNN in different channels and different datasets. The experimental findings indicate that our proposed models exhibit strong performance across various experiments conducted on both training and test sets, even when the network layer is deep. Our model performs the best on the VoxCeleb2 dataset with 1024 channels, achieving an accuracy of 0.97828, an equal error rate (EER) of 0.03612 on the VoxCeleb1-O dataset, and a minimum detection cost function (MinDCF) of 0.43967. This technology can improve public safety and service efficiency in smart city construction, promote finance, education, and other fields, and bring more convenience to people's lives.
format Article
id doaj-art-52e01bdfa06d4d628986c2882c121a5b
institution DOAJ
issn 1687-4722
language English
publishDate 2025-02-01
publisher SpringerOpen
record_format Article
series EURASIP Journal on Audio, Speech, and Music Processing
spelling doaj-art-52e01bdfa06d4d628986c2882c121a5b2025-08-20T02:43:13ZengSpringerOpenEURASIP Journal on Audio, Speech, and Music Processing1687-47222025-02-012025111510.1186/s13636-025-00396-4Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNNPinyan Li0Lap Man Hoi1Yapeng Wang2Xu Yang3Sio Kei Im4Faculty of Applied Sciences, Macao Polytechnic UniversityFaculty of Applied Sciences, Macao Polytechnic UniversityFaculty of Applied Sciences, Macao Polytechnic UniversityFaculty of Applied Sciences, Macao Polytechnic UniversityMacao Polytechnic UniversityAbstract In today’s society, speaker recognition plays an increasingly important role. Currently, neural networks are widely employed for extracting speaker features. Although the Emphasized Channel Attention, Propagation, and Aggregation in Time Delay Neural Network (ECAPA-TDNN) model can obtain temporal context information through dilated convolution to some extent, this model falls short in acquiring fully comprehensive speech features. To further improve the accuracy of the model, better capture the temporal context information, and make ECAPA-TDNN unaffected by small offsets in the frequency domain, based on the ECAPA-TDNN model, we combine a two-dimensional convolutional network (Conv2D), a residual network (ResNet), and ECAPA-TDNN to form a novel CRET model. In this study, two CRET models are proposed, and these two models are compared with the baseline models Multi-Scale Backbone Architecture (Res2Net) and ECAPA-TDNN in different channels and different datasets. The experimental findings indicate that our proposed models exhibit strong performance across various experiments conducted on both training and test sets, even when the network layer is deep. Our model performs the best on the VoxCeleb2 dataset with 1024 channels, achieving an accuracy of 0.97828, an equal error rate (EER) of 0.03612 on the VoxCeleb1-O dataset, and a minimum detection cost function (MinDCF) of 0.43967. This technology can improve public safety and service efficiency in smart city construction, promote finance, education, and other fields, and bring more convenience to people's lives.https://doi.org/10.1186/s13636-025-00396-4Speaker recognitionConv2DResNetECAPA-TDNNSmart city
spellingShingle Pinyan Li
Lap Man Hoi
Yapeng Wang
Xu Yang
Sio Kei Im
Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN
EURASIP Journal on Audio, Speech, and Music Processing
Speaker recognition
Conv2D
ResNet
ECAPA-TDNN
Smart city
title Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN
title_full Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN
title_fullStr Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN
title_full_unstemmed Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN
title_short Enhancing Speaker Recognition with CRET Model: a fusion of CONV2D, RESNET and ECAPA-TDNN
title_sort enhancing speaker recognition with cret model a fusion of conv2d resnet and ecapa tdnn
topic Speaker recognition
Conv2D
ResNet
ECAPA-TDNN
Smart city
url https://doi.org/10.1186/s13636-025-00396-4
work_keys_str_mv AT pinyanli enhancingspeakerrecognitionwithcretmodelafusionofconv2dresnetandecapatdnn
AT lapmanhoi enhancingspeakerrecognitionwithcretmodelafusionofconv2dresnetandecapatdnn
AT yapengwang enhancingspeakerrecognitionwithcretmodelafusionofconv2dresnetandecapatdnn
AT xuyang enhancingspeakerrecognitionwithcretmodelafusionofconv2dresnetandecapatdnn
AT siokeiim enhancingspeakerrecognitionwithcretmodelafusionofconv2dresnetandecapatdnn