LSTM autoencoder based parallel architecture for deepfake audio detection with dynamic residual encoding and feature fusion
Abstract With the rapid advancement of synthetic speech technologies, detecting deepfake audio has become essential for preventing impersonation and misinformation. This study aims to enhance detection performance by addressing limitations in existing models, such as temporal inconsistencies, weak c...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-025-08198-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract With the rapid advancement of synthetic speech technologies, detecting deepfake audio has become essential for preventing impersonation and misinformation. This study aims to enhance detection performance by addressing limitations in existing models, such as temporal inconsistencies, weak contextual representation, and reconstruction loss. A novel framework, termed Long Short-Term Memory Auto-Encoder with Dynamic Residual Difference Encoding (LSTM-AE-DRDE), is proposed to overcome these challenges. The framework consists of two parallel modules: one leverages attention-enhanced LSTM with contrastive learning to highlight critical temporal cues, while the other amplifies real-vs-fake separability by computing residual differences across transformed audio variants. By integrating diverse speech features-including MFCC, temporal, prosodic, wavelet packet, and glottal parameters the model captures both low- and high-level audio characteristics. Experimental evaluation was carried out on five benchmark datasets (CVoice Fake, FoR, Deepfake Voice Recognition, ODSS, and CMFD), where the proposed model achieved classification accuracies of 97%, 90%, 96%, 97%, and 95%, respectively. Furthermore, when compared to eleven state-of-the-art methods, the proposed model demonstrates superior performance with an overall ROC-AUC of approximately 98%. In addition, a comprehensive feature-wise ablation study was conducted to assess the contribution of each feature set, confirming the robustness and reliability of the proposed framework. |
|---|---|
| ISSN: | 2045-2322 |