Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution
The goal of the “2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge” (ASVspoof) was to make it easier to create systems that could identify voice spoofing attacks with high levels of accuracy. However, model complexity and latency requirements were not e...
Saved in:
| Main Authors: | , , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2023-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10123935/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849731098499612672 |
|---|---|
| author | Il-Youp Kwak Sungsu Kwag Junhee Lee Youngbae Jeon Jeonghwan Hwang Hyo-Jung Choi Jong-Hoon Yang So-Yul Han Jun Ho Huh Choong-Hoon Lee Ji Won Yoon |
| author_facet | Il-Youp Kwak Sungsu Kwag Junhee Lee Youngbae Jeon Jeonghwan Hwang Hyo-Jung Choi Jong-Hoon Yang So-Yul Han Jun Ho Huh Choong-Hoon Lee Ji Won Yoon |
| author_sort | Il-Youp Kwak |
| collection | DOAJ |
| description | The goal of the “2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge” (ASVspoof) was to make it easier to create systems that could identify voice spoofing attacks with high levels of accuracy. However, model complexity and latency requirements were not emphasized in the competition, despite the fact that they are stringent requirements for implementation in the real world. The majority of the top-performing solutions from the competition used an ensemble technique that merged numerous sophisticated deep learning models to maximize detection accuracy. Those approaches struggle with real-world deployment restrictions for voice assistants which would have restricted resources. We merged skip connection (from ResNet) and max feature map (from Light CNN) to create a compact system, and we tested its performance using the ASVspoof 2019 dataset. Our single model achieved a replay attack detection equal error rate (EER) of 0.30% on the evaluation set using an optimized constant Q transform (CQT) feature, outperforming the top ensemble system in the competition, which scored an EER of 0.39%. We experimented using depthwise separable convolutions (from MobileNet) to reduce model sizes; this resulted in an 84.3 percent reduction in parameter count (from 286K to 45K), while maintaining similar performance (EER of 0.36%). Additionally, we used Grad-CAM to clarify which spectrogram regions significantly contribute to the detection of fake data. |
| format | Article |
| id | doaj-art-2a963012ecda48229db2687f40edbff2 |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2023-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-2a963012ecda48229db2687f40edbff22025-08-20T03:08:40ZengIEEEIEEE Access2169-35362023-01-0111491404915210.1109/ACCESS.2023.327579010123935Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable ConvolutionIl-Youp Kwak0https://orcid.org/0000-0002-7117-7669Sungsu Kwag1https://orcid.org/0009-0007-6912-6452Junhee Lee2https://orcid.org/0009-0000-4292-1973Youngbae Jeon3Jeonghwan Hwang4Hyo-Jung Choi5Jong-Hoon Yang6So-Yul Han7Jun Ho Huh8https://orcid.org/0000-0003-2007-4018Choong-Hoon Lee9https://orcid.org/0000-0001-5146-0259Ji Won Yoon10https://orcid.org/0000-0003-2123-9849Department of Applied Statistics, Chung-Ang University, Seoul, South KoreaSamsung Research, Seoul, South KoreaSamsung Research, Seoul, South KoreaSamsung Research, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaDepartment of Applied Statistics, Chung-Ang University, Seoul, South KoreaDepartment of Applied Statistics, Chung-Ang University, Seoul, South KoreaDepartment of Applied Statistics, Chung-Ang University, Seoul, South KoreaSamsung Research, Seoul, South KoreaSamsung Research, Seoul, South KoreaSchool of Cybersecurity, Korea University, Seoul, South KoreaThe goal of the “2019 Automatic Speaker Verification Spoofing and Countermeasures Challenge” (ASVspoof) was to make it easier to create systems that could identify voice spoofing attacks with high levels of accuracy. However, model complexity and latency requirements were not emphasized in the competition, despite the fact that they are stringent requirements for implementation in the real world. The majority of the top-performing solutions from the competition used an ensemble technique that merged numerous sophisticated deep learning models to maximize detection accuracy. Those approaches struggle with real-world deployment restrictions for voice assistants which would have restricted resources. We merged skip connection (from ResNet) and max feature map (from Light CNN) to create a compact system, and we tested its performance using the ASVspoof 2019 dataset. Our single model achieved a replay attack detection equal error rate (EER) of 0.30% on the evaluation set using an optimized constant Q transform (CQT) feature, outperforming the top ensemble system in the competition, which scored an EER of 0.39%. We experimented using depthwise separable convolutions (from MobileNet) to reduce model sizes; this resulted in an 84.3 percent reduction in parameter count (from 286K to 45K), while maintaining similar performance (EER of 0.36%). Additionally, we used Grad-CAM to clarify which spectrogram regions significantly contribute to the detection of fake data.https://ieeexplore.ieee.org/document/10123935/Voice assistant securityvoice spoofing attackvoice synthesis attackvoice presentation attack detection |
| spellingShingle | Il-Youp Kwak Sungsu Kwag Junhee Lee Youngbae Jeon Jeonghwan Hwang Hyo-Jung Choi Jong-Hoon Yang So-Yul Han Jun Ho Huh Choong-Hoon Lee Ji Won Yoon Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution IEEE Access Voice assistant security voice spoofing attack voice synthesis attack voice presentation attack detection |
| title | Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution |
| title_full | Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution |
| title_fullStr | Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution |
| title_full_unstemmed | Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution |
| title_short | Voice Spoofing Detection Through Residual Network, Max Feature Map, and Depthwise Separable Convolution |
| title_sort | voice spoofing detection through residual network max feature map and depthwise separable convolution |
| topic | Voice assistant security voice spoofing attack voice synthesis attack voice presentation attack detection |
| url | https://ieeexplore.ieee.org/document/10123935/ |
| work_keys_str_mv | AT ilyoupkwak voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT sungsukwag voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT junheelee voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT youngbaejeon voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT jeonghwanhwang voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT hyojungchoi voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT jonghoonyang voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT soyulhan voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT junhohuh voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT choonghoonlee voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution AT jiwonyoon voicespoofingdetectionthroughresidualnetworkmaxfeaturemapanddepthwiseseparableconvolution |