Audio Deepfake Detection Using Deep Learning

ABSTRACT This study introduces an enhanced Siamese convolutional neural network (Siamese CNN) architecture with a novel StacLoss function and self‐attention modules for efficient identification of audio deepfakes. Our module directly compares unprocessed original audio with modified audio by initial...

Full description

Saved in:
Bibliographic Details
Main Authors: Ousama A. Shaaban, Remzi Yildirim
Format: Article
Language:English
Published: Wiley 2025-03-01
Series:Engineering Reports
Subjects:
Online Access:https://doi.org/10.1002/eng2.70087
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850048161879425024
author Ousama A. Shaaban
Remzi Yildirim
author_facet Ousama A. Shaaban
Remzi Yildirim
author_sort Ousama A. Shaaban
collection DOAJ
description ABSTRACT This study introduces an enhanced Siamese convolutional neural network (Siamese CNN) architecture with a novel StacLoss function and self‐attention modules for efficient identification of audio deepfakes. Our module directly compares unprocessed original audio with modified audio by initially applying convolutional operations and dual branches to extract complex characteristics from raw audio signals. These operations are followed by residual connections, which enhance the network's performance. The self‐attention modules are trained in a layered way alongside these fundamental layers to detect multi‐headed attention within audio frames. The StacLoss output represents a customized version of the contrastive loss function. It aids the network in distinguishing between original and modified audios by minimizing the loss between pairs of original audio that have the same identity while maximizing the distance between manipulated audio samples and enhances the process of extracting features compared to standard techniques. The efficacy of the method has been verified by examining a range of audio modifications, and its resilience has been thoroughly assessed on the ASVspoof2019 dataset by comprehensive testing across all possible audio manipulation situations. The proposed Siamese convolutional neural network (CNN) outperformed both machine and deep learning models, achieving impressive metrics. It achieved a remarkable accuracy of 98%, precision of 97%, recall of 96%, F1 score of 96.5%, ROC‐AUC of 99%, and an equal error rate (EER) of 2.95%.
format Article
id doaj-art-8b87628ec7084a5baf22d959c140c71a
institution DOAJ
issn 2577-8196
language English
publishDate 2025-03-01
publisher Wiley
record_format Article
series Engineering Reports
spelling doaj-art-8b87628ec7084a5baf22d959c140c71a2025-08-20T02:54:02ZengWileyEngineering Reports2577-81962025-03-0173n/an/a10.1002/eng2.70087Audio Deepfake Detection Using Deep LearningOusama A. Shaaban0Remzi Yildirim1Graduate School of Natural and Applied Sciences Ankara Yıldırım Beyazıt University Ankara TurkeyDepartment of Computer Engineering Tokat Gaziosmanpaşa University Tokat TurkeyABSTRACT This study introduces an enhanced Siamese convolutional neural network (Siamese CNN) architecture with a novel StacLoss function and self‐attention modules for efficient identification of audio deepfakes. Our module directly compares unprocessed original audio with modified audio by initially applying convolutional operations and dual branches to extract complex characteristics from raw audio signals. These operations are followed by residual connections, which enhance the network's performance. The self‐attention modules are trained in a layered way alongside these fundamental layers to detect multi‐headed attention within audio frames. The StacLoss output represents a customized version of the contrastive loss function. It aids the network in distinguishing between original and modified audios by minimizing the loss between pairs of original audio that have the same identity while maximizing the distance between manipulated audio samples and enhances the process of extracting features compared to standard techniques. The efficacy of the method has been verified by examining a range of audio modifications, and its resilience has been thoroughly assessed on the ASVspoof2019 dataset by comprehensive testing across all possible audio manipulation situations. The proposed Siamese convolutional neural network (CNN) outperformed both machine and deep learning models, achieving impressive metrics. It achieved a remarkable accuracy of 98%, precision of 97%, recall of 96%, F1 score of 96.5%, ROC‐AUC of 99%, and an equal error rate (EER) of 2.95%.https://doi.org/10.1002/eng2.70087audio deepfakedeep learningdeepfakemachine learningSiamese CNN
spellingShingle Ousama A. Shaaban
Remzi Yildirim
Audio Deepfake Detection Using Deep Learning
Engineering Reports
audio deepfake
deep learning
deepfake
machine learning
Siamese CNN
title Audio Deepfake Detection Using Deep Learning
title_full Audio Deepfake Detection Using Deep Learning
title_fullStr Audio Deepfake Detection Using Deep Learning
title_full_unstemmed Audio Deepfake Detection Using Deep Learning
title_short Audio Deepfake Detection Using Deep Learning
title_sort audio deepfake detection using deep learning
topic audio deepfake
deep learning
deepfake
machine learning
Siamese CNN
url https://doi.org/10.1002/eng2.70087
work_keys_str_mv AT ousamaashaaban audiodeepfakedetectionusingdeeplearning
AT remziyildirim audiodeepfakedetectionusingdeeplearning