M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks

Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-...

Full description

Saved in:
Bibliographic Details
Main Authors: Xudong Liang, Jiang Xie, Mengfei Zhang, Zhuo Bi
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Bioengineering
Subjects:
Online Access:https://www.mdpi.com/2306-5354/12/7/738
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850077834574299136
author Xudong Liang
Jiang Xie
Mengfei Zhang
Zhuo Bi
author_facet Xudong Liang
Jiang Xie
Mengfei Zhang
Zhuo Bi
author_sort Xudong Liang
collection DOAJ
description Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet its application to medical VL backbone models during pre-training remains underexplored. To address this, M3AE-Distill, a lightweight medical VL model, is proposed to uphold high performance while enhancing efficiency. During pre-training, two key strategies are developed: (1) both hidden state and attention map distillation are employed to guide the student model, and (2) an attention-guided masking strategy is designed to enhance fine-grained image–text alignment. Extensive experiments on five medical VL datasets across three tasks validate the effectiveness of M3AE-Distill. Two student variants, M3AE-Distill-Small and M3AE-Distill-Base, are provided to support a flexible trade-off between efficiency and accuracy. M3AE-Distill-Base consistently outperforms existing models and achieves performance comparable to the teacher model, while delivering 2.11× and 2.61× speedups during inference and fine-tuning, respectively.
format Article
id doaj-art-0f7cdff7e65140d2aa76682ece12652a
institution DOAJ
issn 2306-5354
language English
publishDate 2025-07-01
publisher MDPI AG
record_format Article
series Bioengineering
spelling doaj-art-0f7cdff7e65140d2aa76682ece12652a2025-08-20T02:45:43ZengMDPI AGBioengineering2306-53542025-07-0112773810.3390/bioengineering12070738M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream TasksXudong Liang0Jiang Xie1Mengfei Zhang2Zhuo Bi3School of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Information Technology, Shanghai Jian Qiao University, Shanghai 201306, ChinaMulti-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet its application to medical VL backbone models during pre-training remains underexplored. To address this, M3AE-Distill, a lightweight medical VL model, is proposed to uphold high performance while enhancing efficiency. During pre-training, two key strategies are developed: (1) both hidden state and attention map distillation are employed to guide the student model, and (2) an attention-guided masking strategy is designed to enhance fine-grained image–text alignment. Extensive experiments on five medical VL datasets across three tasks validate the effectiveness of M3AE-Distill. Two student variants, M3AE-Distill-Small and M3AE-Distill-Base, are provided to support a flexible trade-off between efficiency and accuracy. M3AE-Distill-Base consistently outperforms existing models and achieves performance comparable to the teacher model, while delivering 2.11× and 2.61× speedups during inference and fine-tuning, respectively.https://www.mdpi.com/2306-5354/12/7/738vision–languagedeep learningknowledge distillationpre-training
spellingShingle Xudong Liang
Jiang Xie
Mengfei Zhang
Zhuo Bi
M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
Bioengineering
vision–language
deep learning
knowledge distillation
pre-training
title M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_full M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_fullStr M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_full_unstemmed M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_short M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
title_sort m3ae distill an efficient distilled model for medical vision language downstream tasks
topic vision–language
deep learning
knowledge distillation
pre-training
url https://www.mdpi.com/2306-5354/12/7/738
work_keys_str_mv AT xudongliang m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks
AT jiangxie m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks
AT mengfeizhang m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks
AT zhuobi m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks