Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models

Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and effici...

Full description

Saved in:
Bibliographic Details
Main Authors: Kai Zhang, Jinqiu Li, Bingqian Wang, Haoran Meng
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/14/20/9180
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850205957157552128
author Kai Zhang
Jinqiu Li
Bingqian Wang
Haoran Meng
author_facet Kai Zhang
Jinqiu Li
Bingqian Wang
Haoran Meng
author_sort Kai Zhang
collection DOAJ
description Pre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and efficient method called Autocorrelation Matrix Knowledge Distillation (AMKD), aimed at improving the performance of smaller BERT models for specific tasks and making them more applicable in practical deployment scenarios. The AMKD method effectively captures the relationships between features using the autocorrelation matrix, enabling the student model to learn not only the performance of individual features from the teacher model but also the correlations among these features. Additionally, it addresses the issue of dimensional mismatch between the hidden states of the student and teacher models. Even in cases where the dimensions are smaller, AMKD retains the essential features from the teacher model, thereby minimizing information loss. Experimental results demonstrate that BERT<sub>TINY</sub>-AMKD outperforms traditional distillation methods and baseline models, achieving an average score of 83.6% on GLUE tasks. This represents a 4.1% improvement over BERT<sub>TINY</sub>-KD and exceeds the performance of BERT<sub>4</sub>-PKD and DistilBERT<sub>4</sub> by 2.6% and 3.9%, respectively. Moreover, despite having only 13.3% of the parameters of BERT<sub>BASE</sub>, the BERT<sub>TINY</sub>-AMKD model retains over 96.3% of the performance of the teacher model, BERT<sub>BASE</sub>.
format Article
id doaj-art-8a3973395a7b40d4bf6bd4d730384f20
institution OA Journals
issn 2076-3417
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-8a3973395a7b40d4bf6bd4d730384f202025-08-20T02:10:57ZengMDPI AGApplied Sciences2076-34172024-10-011420918010.3390/app14209180Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT ModelsKai Zhang0Jinqiu Li1Bingqian Wang2Haoran Meng3Department of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaDepartment of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaDepartment of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaDepartment of Electronic Information Engineering, School of Physics and Electronic Information, Henan Polytechnic University, Wenyuan Street, Jiaozuo 454099, ChinaPre-trained language models perform well in various natural language processing tasks. However, their large number of parameters poses significant challenges for edge devices with limited resources, greatly limiting their application in practical deployment. This paper introduces a simple and efficient method called Autocorrelation Matrix Knowledge Distillation (AMKD), aimed at improving the performance of smaller BERT models for specific tasks and making them more applicable in practical deployment scenarios. The AMKD method effectively captures the relationships between features using the autocorrelation matrix, enabling the student model to learn not only the performance of individual features from the teacher model but also the correlations among these features. Additionally, it addresses the issue of dimensional mismatch between the hidden states of the student and teacher models. Even in cases where the dimensions are smaller, AMKD retains the essential features from the teacher model, thereby minimizing information loss. Experimental results demonstrate that BERT<sub>TINY</sub>-AMKD outperforms traditional distillation methods and baseline models, achieving an average score of 83.6% on GLUE tasks. This represents a 4.1% improvement over BERT<sub>TINY</sub>-KD and exceeds the performance of BERT<sub>4</sub>-PKD and DistilBERT<sub>4</sub> by 2.6% and 3.9%, respectively. Moreover, despite having only 13.3% of the parameters of BERT<sub>BASE</sub>, the BERT<sub>TINY</sub>-AMKD model retains over 96.3% of the performance of the teacher model, BERT<sub>BASE</sub>.https://www.mdpi.com/2076-3417/14/20/9180BERTknowledge distillationmodel compression
spellingShingle Kai Zhang
Jinqiu Li
Bingqian Wang
Haoran Meng
Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
Applied Sciences
BERT
knowledge distillation
model compression
title Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
title_full Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
title_fullStr Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
title_full_unstemmed Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
title_short Autocorrelation Matrix Knowledge Distillation: A Task-Specific Distillation Method for BERT Models
title_sort autocorrelation matrix knowledge distillation a task specific distillation method for bert models
topic BERT
knowledge distillation
model compression
url https://www.mdpi.com/2076-3417/14/20/9180
work_keys_str_mv AT kaizhang autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels
AT jinqiuli autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels
AT bingqianwang autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels
AT haoranmeng autocorrelationmatrixknowledgedistillationataskspecificdistillationmethodforbertmodels