Leveraging logit uncertainty for better knowledge distillation

Abstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and f...

Full description

Saved in:
Bibliographic Details
Main Authors: Zhen Guo, Dong Wang, Qiang He, Pengzhou Zhang
Format: Article
Language:English
Published: Nature Portfolio 2024-12-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-024-82647-6
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850103054731313152
author Zhen Guo
Dong Wang
Qiang He
Pengzhou Zhang
author_facet Zhen Guo
Dong Wang
Qiang He
Pengzhou Zhang
author_sort Zhen Guo
collection DOAJ
description Abstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model’s predictions. A confidence threshold, based on the teacher’s predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models’ outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method.
format Article
id doaj-art-abf702d9977e4c758b0f92f366d01863
institution DOAJ
issn 2045-2322
language English
publishDate 2024-12-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-abf702d9977e4c758b0f92f366d018632025-08-20T02:39:38ZengNature PortfolioScientific Reports2045-23222024-12-0114111110.1038/s41598-024-82647-6Leveraging logit uncertainty for better knowledge distillationZhen Guo0Dong Wang1Qiang He2Pengzhou Zhang3Communication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationAbstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model’s predictions. A confidence threshold, based on the teacher’s predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models’ outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method.https://doi.org/10.1038/s41598-024-82647-6Knowledge distillationUncertainty learning
spellingShingle Zhen Guo
Dong Wang
Qiang He
Pengzhou Zhang
Leveraging logit uncertainty for better knowledge distillation
Scientific Reports
Knowledge distillation
Uncertainty learning
title Leveraging logit uncertainty for better knowledge distillation
title_full Leveraging logit uncertainty for better knowledge distillation
title_fullStr Leveraging logit uncertainty for better knowledge distillation
title_full_unstemmed Leveraging logit uncertainty for better knowledge distillation
title_short Leveraging logit uncertainty for better knowledge distillation
title_sort leveraging logit uncertainty for better knowledge distillation
topic Knowledge distillation
Uncertainty learning
url https://doi.org/10.1038/s41598-024-82647-6
work_keys_str_mv AT zhenguo leveraginglogituncertaintyforbetterknowledgedistillation
AT dongwang leveraginglogituncertaintyforbetterknowledgedistillation
AT qianghe leveraginglogituncertaintyforbetterknowledgedistillation
AT pengzhouzhang leveraginglogituncertaintyforbetterknowledgedistillation