Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and effici...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-10-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/14/20/9180 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850205957157552128 |
|---|---|
| author | Kai Zhang Jinqiu Li Bingqian Wang Haoran Meng |
| author_facet | Kai Zhang Jinqiu Li Bingqian Wang Haoran Meng |
| author_sort | Kai Zhang |
| collection | DOAJ |
| description | Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and efficient method called Autocorrelation Matrix Knowledge Distillation (AMKD), aimed at improving the performance of smaller BERT models for specific tasks and making them more applicable in practical deployment scenarios. The AMKD method effectively captures the relationships between features using the autocorrelation matrix, enabling the student model to learn not only the performance of individual features from the teacher model but also the correlations among these features. Additionally, it addresses the issue of dimensional mismatch between the hidden states of the student and teacher models. Even in cases where the dimensions are smaller, AMKD retains the essential features from the teacher model, thereby minimizing information loss. Experimental results demonstrate that BERT<sub>TINY</sub>-AMKD outperforms traditional distillation methods and baseline models, achieving an average score of 83.6% on GLUE tasks. This represents a 4.1% improvement over BERT<sub>TINY</sub>-KD and exceeds the performance of BERT<sub>4</sub>-PKD and DistilBERT<sub>4</sub> by 2.6% and 3.9%, respectively. Moreover, despite having only 13.3% of the parameters of BERT<sub>BASE</sub>, the BERT<sub>TINY</sub>-AMKD model retains over 96.3% of the performance of the teacher model, BERT<sub>BASE</sub>. |
| format | Article |
| id | doaj-art-8a3973395a7b40d4bf6bd4d730384f20 |
| institution | OA Journals |
| issn | 2076-3417 |
| language | English |
| publishDate | 2024-10-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-8a3973395a7b40d4bf6bd4d730384f202025-08-20T02:10:57ZengMDPI AGApplied Sciences2076-34172024-10-011420918010.3390/app14209180Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT ModelsKai Zhang0Jinqiu Li1Bingqian Wang2Haoran Meng3Department of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaDepartment of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaDepartment of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaDepartment of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaPre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and efficient method called Autocorrelation Matrix Knowledge Distillation (AMKD), aimed at improving the performance of smaller BERT models for specific tasks and making them more applicable in practical deployment scenarios. The AMKD method effectively captures the relationships between features using the autocorrelation matrix, enabling the student model to learn not only the performance of individual features from the teacher model but also the correlations among these features. Additionally, it addresses the issue of dimensional mismatch between the hidden states of the student and teacher models. Even in cases where the dimensions are smaller, AMKD retains the essential features from the teacher model, thereby minimizing information loss. Experimental results demonstrate that BERT<sub>TINY</sub>-AMKD outperforms traditional distillation methods and baseline models, achieving an average score of 83.6% on GLUE tasks. This represents a 4.1% improvement over BERT<sub>TINY</sub>-KD and exceeds the performance of BERT<sub>4</sub>-PKD and DistilBERT<sub>4</sub> by 2.6% and 3.9%, respectively. Moreover, despite having only 13.3% of the parameters of BERT<sub>BASE</sub>, the BERT<sub>TINY</sub>-AMKD model retains over 96.3% of the performance of the teacher model, BERT<sub>BASE</sub>.https://www.mdpi.com/2076-3417/14/20/9180BERTknowledge distillationmodel compression |
| spellingShingle | Kai Zhang Jinqiu Li Bingqian Wang Haoran Meng Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models Applied Sciences BERT knowledge distillation model compression |
| title | Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models |
| title_full | Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models |
| title_fullStr | Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models |
| title_full_unstemmed | Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models |
| title_short | Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models |
| title_sort | autocorrelation matrix knowledge distillation a task specific distillation method for bert models |
| topic | BERT knowledge distillation model compression |
| url | https://www.mdpi.com/2076-3417/14/20/9180 |
| work_keys_str_mv | AT kaizhang autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels AT jinqiuli autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels AT bingqianwang autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels AT haoranmeng autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels |