BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillation

Abstract Backdoor attacks present significant risks to the security of deep neural networks (DNNs) in NLP domain, as the attackers can covertly manipulate the model’s output behavior either by poisoning the training data or tampering model’s training process. This paper introduces a novel backdoor d...

Full description

Saved in:
Bibliographic Details
Main Authors: Zijie Zhang, Xinyuan Miao, Chenyu Zhou, Chenming Shang, Xi Chen, Xianglong Kong, Wei Huang, Yi Cao
Format: Article
Language:English
Published: Springer 2025-07-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-02006-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849388809733537792
author Zijie Zhang
Xinyuan Miao
Chenyu Zhou
Chenming Shang
Xi Chen
Xianglong Kong
Wei Huang
Yi Cao
author_facet Zijie Zhang
Xinyuan Miao
Chenyu Zhou
Chenming Shang
Xi Chen
Xianglong Kong
Wei Huang
Yi Cao
author_sort Zijie Zhang
collection DOAJ
description Abstract Backdoor attacks present significant risks to the security of deep neural networks (DNNs) in NLP domain, as the attackers can covertly manipulate the model’s output behavior either by poisoning the training data or tampering model’s training process. This paper introduces a novel backdoor defence strategy, Backdoor Defense via Ensemble Knowledge Distillation (BDEKD), to mitigate various types of backdoors in compromised DNNs. It is marked as the first utilization of ensemble methods in enhancing backdoor mitigation. The BDEKD framework only requires a minimal subset of clean data to clean the compromised model, generating several relatively heterogeneous and backdoor-cleaned teacher models. This process is followed by an enhancement of the training data through augmentation, and the implementation of an ensemble distillation technique specifically designed to mitigate the backdoor from the model. Our empirical analysis demonstrates that BDEKD effectively lowers the success rate of six sophisticated backdoor attacks to approximately 17%, while only requiring 20% of the training data. Crucially, it preserves the model’s accuracy on clean data around 85%, ensuring minimal impact on its intended functionality. Our code is available at https://github.com/quanzhuangdefujinan/BDEKD-Research/tree/BDEKD . Graphical abstract
format Article
id doaj-art-87f0d5aef0d64402b5a78ed866c9cb2b
institution Kabale University
issn 2199-4536
2198-6053
language English
publishDate 2025-07-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-87f0d5aef0d64402b5a78ed866c9cb2b2025-08-20T03:42:10ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-07-0111911710.1007/s40747-025-02006-4BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillationZijie Zhang0Xinyuan Miao1Chenyu Zhou2Chenming Shang3Xi Chen4Xianglong Kong5Wei Huang6Yi Cao7Southeast UniversityPurple Mountain LaboratoriesSoutheast UniversityTsinghua UniversityPurple Mountain LaboratoriesPurple Mountain LaboratoriesPurple Mountain LaboratoriesPurple Mountain LaboratoriesAbstract Backdoor attacks present significant risks to the security of deep neural networks (DNNs) in NLP domain, as the attackers can covertly manipulate the model’s output behavior either by poisoning the training data or tampering model’s training process. This paper introduces a novel backdoor defence strategy, Backdoor Defense via Ensemble Knowledge Distillation (BDEKD), to mitigate various types of backdoors in compromised DNNs. It is marked as the first utilization of ensemble methods in enhancing backdoor mitigation. The BDEKD framework only requires a minimal subset of clean data to clean the compromised model, generating several relatively heterogeneous and backdoor-cleaned teacher models. This process is followed by an enhancement of the training data through augmentation, and the implementation of an ensemble distillation technique specifically designed to mitigate the backdoor from the model. Our empirical analysis demonstrates that BDEKD effectively lowers the success rate of six sophisticated backdoor attacks to approximately 17%, while only requiring 20% of the training data. Crucially, it preserves the model’s accuracy on clean data around 85%, ensuring minimal impact on its intended functionality. Our code is available at https://github.com/quanzhuangdefujinan/BDEKD-Research/tree/BDEKD . Graphical abstracthttps://doi.org/10.1007/s40747-025-02006-4Backdoor defenseNatural language processingKnowledge distillationEnsemble learning
spellingShingle Zijie Zhang
Xinyuan Miao
Chenyu Zhou
Chenming Shang
Xi Chen
Xianglong Kong
Wei Huang
Yi Cao
BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillation
Complex & Intelligent Systems
Backdoor defense
Natural language processing
Knowledge distillation
Ensemble learning
title BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillation
title_full BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillation
title_fullStr BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillation
title_full_unstemmed BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillation
title_short BDEKD: mitigating backdoor attacks in NLP models via ensemble knowledge distillation
title_sort bdekd mitigating backdoor attacks in nlp models via ensemble knowledge distillation
topic Backdoor defense
Natural language processing
Knowledge distillation
Ensemble learning
url https://doi.org/10.1007/s40747-025-02006-4
work_keys_str_mv AT zijiezhang bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation
AT xinyuanmiao bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation
AT chenyuzhou bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation
AT chenmingshang bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation
AT xichen bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation
AT xianglongkong bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation
AT weihuang bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation
AT yicao bdekdmitigatingbackdoorattacksinnlpmodelsviaensembleknowledgedistillation