Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution

The goal of the “2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge” (ASVspoof) was to make it easier to create systems that could identify voice spoofing attacks with high levels of accuracy. However, model complexity and latency requirements were not e...

Full description

Saved in:
Bibliographic Details
Main Authors: Il-Youp Kwak, Sungsu Kwag, Junhee Lee, Youngbae Jeon, Jeonghwan Hwang, Hyo-Jung Choi, Jong-Hoon Yang, So-Yul Han, Jun Ho Huh, Choong-Hoon Lee, Ji Won Yoon
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10123935/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849731098499612672
author Il-Youp Kwak
Sungsu Kwag
Junhee Lee
Youngbae Jeon
Jeonghwan Hwang
Hyo-Jung Choi
Jong-Hoon Yang
So-Yul Han
Jun Ho Huh
Choong-Hoon Lee
Ji Won Yoon
author_facet Il-Youp Kwak
Sungsu Kwag
Junhee Lee
Youngbae Jeon
Jeonghwan Hwang
Hyo-Jung Choi
Jong-Hoon Yang
So-Yul Han
Jun Ho Huh
Choong-Hoon Lee
Ji Won Yoon
author_sort Il-Youp Kwak
collection DOAJ
description The goal of the “2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge” (ASVspoof) was to make it easier to create systems that could identify voice spoofing attacks with high levels of accuracy. However, model complexity and latency requirements were not emphasized in the competition, despite the fact that they are stringent requirements for implementation in the real world. The majority of the top-performing solutions from the competition used an ensemble technique that merged numerous sophisticated deep learning models to maximize detection accuracy. Those approaches struggle with real-world deployment restrictions for voice assistants which would have restricted resources. We merged skip connection (from ResNet) and max feature map (from Light CNN) to create a compact system, and we tested its performance using the ASVspoof 2019 dataset. Our single model achieved a replay attack detection equal error rate (EER) of 0.30% on the evaluation set using an optimized constant Q transform (CQT) feature, outperforming the top ensemble system in the competition, which scored an EER of 0.39%. We experimented using depthwise separable convolutions (from MobileNet) to reduce model sizes; this resulted in an 84.3 percent reduction in parameter count (from 286K to 45K), while maintaining similar performance (EER of 0.36%). Additionally, we used Grad-CAM to clarify which spectrogram regions significantly contribute to the detection of fake data.
format Article
id doaj-art-2a963012ecda48229db2687f40edbff2
institution DOAJ
issn 2169-3536
language English
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-2a963012ecda48229db2687f40edbff22025-08-20T03:08:40ZengIEEEIEEE Access2169-35362023-01-0111491404915210.1109/ACCESS.2023.327579010123935Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable ConvolutionIl-Youp Kwak0https://orcid.org/0000-0002-7117-7669Sungsu Kwag1https://orcid.org/0009-0007-6912-6452Junhee Lee2https://orcid.org/0009-0000-4292-1973Youngbae Jeon3Jeonghwan Hwang4Hyo-Jung Choi5Jong-Hoon Yang6So-Yul Han7Jun Ho Huh8https://orcid.org/0000-0003-2007-4018Choong-Hoon Lee9https://orcid.org/0000-0001-5146-0259Ji Won Yoon10https://orcid.org/0000-0003-2123-9849Department of Applied Statistics, Chung-Ang University, Seoul, South KoreaSamsung Research, Seoul, South KoreaSamsung Research, Seoul, South KoreaSamsung Research, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaDepartment of Applied Statistics, Chung-Ang University, Seoul, South KoreaDepartment of Applied Statistics, Chung-Ang University, Seoul, South KoreaDepartment of Applied Statistics, Chung-Ang University, Seoul, South KoreaSamsung Research, Seoul, South KoreaSamsung Research, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaThe goal of the “2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge” (ASVspoof) was to make it easier to create systems that could identify voice spoofing attacks with high levels of accuracy. However, model complexity and latency requirements were not emphasized in the competition, despite the fact that they are stringent requirements for implementation in the real world. The majority of the top-performing solutions from the competition used an ensemble technique that merged numerous sophisticated deep learning models to maximize detection accuracy. Those approaches struggle with real-world deployment restrictions for voice assistants which would have restricted resources. We merged skip connection (from ResNet) and max feature map (from Light CNN) to create a compact system, and we tested its performance using the ASVspoof 2019 dataset. Our single model achieved a replay attack detection equal error rate (EER) of 0.30% on the evaluation set using an optimized constant Q transform (CQT) feature, outperforming the top ensemble system in the competition, which scored an EER of 0.39%. We experimented using depthwise separable convolutions (from MobileNet) to reduce model sizes; this resulted in an 84.3 percent reduction in parameter count (from 286K to 45K), while maintaining similar performance (EER of 0.36%). Additionally, we used Grad-CAM to clarify which spectrogram regions significantly contribute to the detection of fake data.https://ieeexplore.ieee.org/document/10123935/Voice assistant securityvoice spoofing attackvoice synthesis attackvoice presentation attack detection
spellingShingle Il-Youp Kwak
Sungsu Kwag
Junhee Lee
Youngbae Jeon
Jeonghwan Hwang
Hyo-Jung Choi
Jong-Hoon Yang
So-Yul Han
Jun Ho Huh
Choong-Hoon Lee
Ji Won Yoon
Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution
IEEE Access
Voice assistant security
voice spoofing attack
voice synthesis attack
voice presentation attack detection
title Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution
title_full Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution
title_fullStr Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution
title_full_unstemmed Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution
title_short Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution
title_sort voice spoofing detection through residual network max feature map and depthwise separable convolution
topic Voice assistant security
voice spoofing attack
voice synthesis attack
voice presentation attack detection
url https://ieeexplore.ieee.org/document/10123935/
work_keys_str_mv AT ilyoupkwak voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT sungsukwag voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT junheelee voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT youngbaejeon voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT jeonghwanhwang voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT hyojungchoi voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT jonghoonyang voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT soyulhan voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT junhohuh voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT choonghoonlee voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution
AT jiwonyoon voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution