M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks

Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xudong Liang, Jiang Xie, Mengfei Zhang, Zhuo Bi
Format:	Article
Language:	English
Published:	MDPI AG 2025-07-01
Series:	Bioengineering
Subjects:	vision–language deep learning knowledge distillation pre-training
Online Access:	https://www.mdpi.com/2306-5354/12/7/738
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850077834574299136
author	Xudong Liang Jiang Xie Mengfei Zhang Zhuo Bi
author_facet	Xudong Liang Jiang Xie Mengfei Zhang Zhuo Bi
author_sort	Xudong Liang
collection	DOAJ
description	Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet its application to medical VL backbone models during pre-training remains underexplored. To address this, M3AE-Distill, a lightweight medical VL model, is proposed to uphold high performance while enhancing efficiency. During pre-training, two key strategies are developed: (1) both hidden state and attention map distillation are employed to guide the student model, and (2) an attention-guided masking strategy is designed to enhance fine-grained image–text alignment. Extensive experiments on five medical VL datasets across three tasks validate the effectiveness of M3AE-Distill. Two student variants, M3AE-Distill-Small and M3AE-Distill-Base, are provided to support a flexible trade-off between efficiency and accuracy. M3AE-Distill-Base consistently outperforms existing models and achieves performance comparable to the teacher model, while delivering 2.11× and 2.61× speedups during inference and fine-tuning, respectively.
format	Article
id	doaj-art-0f7cdff7e65140d2aa76682ece12652a
institution	DOAJ
issn	2306-5354
language	English
publishDate	2025-07-01
publisher	MDPI AG
record_format	Article
series	Bioengineering
spelling	doaj-art-0f7cdff7e65140d2aa76682ece12652a2025-08-20T02:45:43ZengMDPI AGBioengineering2306-53542025-07-0112773810.3390/bioengineering12070738M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream TasksXudong Liang0Jiang Xie1Mengfei Zhang2Zhuo Bi3School of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Information Technology, Shanghai Jian Qiao University, Shanghai 201306, ChinaMulti-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet its application to medical VL backbone models during pre-training remains underexplored. To address this, M3AE-Distill, a lightweight medical VL model, is proposed to uphold high performance while enhancing efficiency. During pre-training, two key strategies are developed: (1) both hidden state and attention map distillation are employed to guide the student model, and (2) an attention-guided masking strategy is designed to enhance fine-grained image–text alignment. Extensive experiments on five medical VL datasets across three tasks validate the effectiveness of M3AE-Distill. Two student variants, M3AE-Distill-Small and M3AE-Distill-Base, are provided to support a flexible trade-off between efficiency and accuracy. M3AE-Distill-Base consistently outperforms existing models and achieves performance comparable to the teacher model, while delivering 2.11× and 2.61× speedups during inference and fine-tuning, respectively.https://www.mdpi.com/2306-5354/12/7/738vision–languagedeep learningknowledge distillationpre-training
spellingShingle	Xudong Liang Jiang Xie Mengfei Zhang Zhuo Bi M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks Bioengineering vision–language deep learning knowledge distillation pre-training
title	M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_full	M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_fullStr	M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_full_unstemmed	M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_short	M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_sort	m3ae distill an efficient distilled model for medical vision language downstream tasks
topic	vision–language deep learning knowledge distillation pre-training
url	https://www.mdpi.com/2306-5354/12/7/738
work_keys_str_mv	AT xudongliang m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks AT jiangxie m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks AT mengfeizhang m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks AT zhuobi m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks

M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks

Similar Items