Leveraging logit uncertainty for better knowledge distillation

Abstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and f...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zhen Guo, Dong Wang, Qiang He, Pengzhou Zhang
Format:	Article
Language:	English
Published:	Nature Portfolio 2024-12-01
Series:	Scientific Reports
Subjects:	Knowledge distillation Uncertainty learning
Online Access:	https://doi.org/10.1038/s41598-024-82647-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850103054731313152
author	Zhen Guo Dong Wang Qiang He Pengzhou Zhang
author_facet	Zhen Guo Dong Wang Qiang He Pengzhou Zhang
author_sort	Zhen Guo
collection	DOAJ
description	Abstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model’s predictions. A confidence threshold, based on the teacher’s predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models’ outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method.
format	Article
id	doaj-art-abf702d9977e4c758b0f92f366d01863
institution	DOAJ
issn	2045-2322
language	English
publishDate	2024-12-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-abf702d9977e4c758b0f92f366d018632025-08-20T02:39:38ZengNature PortfolioScientific Reports2045-23222024-12-0114111110.1038/s41598-024-82647-6Leveraging logit uncertainty for better knowledge distillationZhen Guo0Dong Wang1Qiang He2Pengzhou Zhang3Communication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationAbstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model’s predictions. A confidence threshold, based on the teacher’s predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models’ outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method.https://doi.org/10.1038/s41598-024-82647-6Knowledge distillationUncertainty learning
spellingShingle	Zhen Guo Dong Wang Qiang He Pengzhou Zhang Leveraging logit uncertainty for better knowledge distillation Scientific Reports Knowledge distillation Uncertainty learning
title	Leveraging logit uncertainty for better knowledge distillation
title_full	Leveraging logit uncertainty for better knowledge distillation
title_fullStr	Leveraging logit uncertainty for better knowledge distillation
title_full_unstemmed	Leveraging logit uncertainty for better knowledge distillation
title_short	Leveraging logit uncertainty for better knowledge distillation
title_sort	leveraging logit uncertainty for better knowledge distillation
topic	Knowledge distillation Uncertainty learning
url	https://doi.org/10.1038/s41598-024-82647-6
work_keys_str_mv	AT zhenguo leveraginglogituncertaintyforbetterknowledgedistillation AT dongwang leveraginglogituncertaintyforbetterknowledgedistillation AT qianghe leveraginglogituncertaintyforbetterknowledgedistillation AT pengzhouzhang leveraginglogituncertaintyforbetterknowledgedistillation

Leveraging logit uncertainty for better knowledge distillation

Similar Items