Leveraging logit uncertainty for better knowledge distillation
Abstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and f...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2024-12-01
|
| Series: | Scientific Reports |
| Subjects: | |
| Online Access: | https://doi.org/10.1038/s41598-024-82647-6 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850103054731313152 |
|---|---|
| author | Zhen Guo Dong Wang Qiang He Pengzhou Zhang |
| author_facet | Zhen Guo Dong Wang Qiang He Pengzhou Zhang |
| author_sort | Zhen Guo |
| collection | DOAJ |
| description | Abstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model’s predictions. A confidence threshold, based on the teacher’s predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models’ outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method. |
| format | Article |
| id | doaj-art-abf702d9977e4c758b0f92f366d01863 |
| institution | DOAJ |
| issn | 2045-2322 |
| language | English |
| publishDate | 2024-12-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Reports |
| spelling | doaj-art-abf702d9977e4c758b0f92f366d018632025-08-20T02:39:38ZengNature PortfolioScientific Reports2045-23222024-12-0114111110.1038/s41598-024-82647-6Leveraging logit uncertainty for better knowledge distillationZhen Guo0Dong Wang1Qiang He2Pengzhou Zhang3Communication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationCommunication University of China, State Key Laboratory of Media Convergence and CommunicationAbstract Knowledge distillation improves student model performance. However, using a larger teacher model does not necessarily result in better distillation gains due to significant architecture and output gaps with smaller student networks. To address this issue, we reconsider teacher outputs and find that categories with strong teacher confidence benefit distillation more, while those with weaker certainty contribute less. Thus, we propose Logits Uncertainty Distillation (LUD) to bridge this gap. We introduce category uncertainty weighting to consider the uncertainty in the teacher model’s predictions. A confidence threshold, based on the teacher’s predictions, helps construct a mask that discounts uncertain classes during distillation. Furthermore, we incorporate two Spearman correlation loss functions to align the logits of the teacher and student models. These loss functions measure the discrepancy between the models’ outputs at the category and sample levels. We also introduce adaptive dynamic temperature factors to optimize the distillation process. By combining these techniques, we enhance knowledge distillation results and facilitate effective knowledge transfer between teacher and student models, even when architectural differences exist. Extensive experiments on multiple datasets demonstrate the effectiveness of our method.https://doi.org/10.1038/s41598-024-82647-6Knowledge distillationUncertainty learning |
| spellingShingle | Zhen Guo Dong Wang Qiang He Pengzhou Zhang Leveraging logit uncertainty for better knowledge distillation Scientific Reports Knowledge distillation Uncertainty learning |
| title | Leveraging logit uncertainty for better knowledge distillation |
| title_full | Leveraging logit uncertainty for better knowledge distillation |
| title_fullStr | Leveraging logit uncertainty for better knowledge distillation |
| title_full_unstemmed | Leveraging logit uncertainty for better knowledge distillation |
| title_short | Leveraging logit uncertainty for better knowledge distillation |
| title_sort | leveraging logit uncertainty for better knowledge distillation |
| topic | Knowledge distillation Uncertainty learning |
| url | https://doi.org/10.1038/s41598-024-82647-6 |
| work_keys_str_mv | AT zhenguo leveraginglogituncertaintyforbetterknowledgedistillation AT dongwang leveraginglogituncertaintyforbetterknowledgedistillation AT qianghe leveraginglogituncertaintyforbetterknowledgedistillation AT pengzhouzhang leveraginglogituncertaintyforbetterknowledgedistillation |