M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks
Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-07-01
|
| Series: | Bioengineering |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2306-5354/12/7/738 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850077834574299136 |
|---|---|
| author | Xudong Liang Jiang Xie Mengfei Zhang Zhuo Bi |
| author_facet | Xudong Liang Jiang Xie Mengfei Zhang Zhuo Bi |
| author_sort | Xudong Liang |
| collection | DOAJ |
| description | Multi-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet its application to medical VL backbone models during pre-training remains underexplored. To address this, M3AE-Distill, a lightweight medical VL model, is proposed to uphold high performance while enhancing efficiency. During pre-training, two key strategies are developed: (1) both hidden state and attention map distillation are employed to guide the student model, and (2) an attention-guided masking strategy is designed to enhance fine-grained image–text alignment. Extensive experiments on five medical VL datasets across three tasks validate the effectiveness of M3AE-Distill. Two student variants, M3AE-Distill-Small and M3AE-Distill-Base, are provided to support a flexible trade-off between efficiency and accuracy. M3AE-Distill-Base consistently outperforms existing models and achieves performance comparable to the teacher model, while delivering 2.11× and 2.61× speedups during inference and fine-tuning, respectively. |
| format | Article |
| id | doaj-art-0f7cdff7e65140d2aa76682ece12652a |
| institution | DOAJ |
| issn | 2306-5354 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Bioengineering |
| spelling | doaj-art-0f7cdff7e65140d2aa76682ece12652a2025-08-20T02:45:43ZengMDPI AGBioengineering2306-53542025-07-0112773810.3390/bioengineering12070738M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream TasksXudong Liang0Jiang Xie1Mengfei Zhang2Zhuo Bi3School of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Computer Engineering and Science, Shanghai University, Shanghai 200444, ChinaSchool of Information Technology, Shanghai Jian Qiao University, Shanghai 201306, ChinaMulti-modal masked autoencoder (M3AE) are widely studied medical vision–language (VL) models that can be applied to various clinical tasks. However, its large parameter size poses challenges for deployment in real-world settings. Knowledge distillation (KD) has proven effective for compressing task-specific uni-modal models, yet its application to medical VL backbone models during pre-training remains underexplored. To address this, M3AE-Distill, a lightweight medical VL model, is proposed to uphold high performance while enhancing efficiency. During pre-training, two key strategies are developed: (1) both hidden state and attention map distillation are employed to guide the student model, and (2) an attention-guided masking strategy is designed to enhance fine-grained image–text alignment. Extensive experiments on five medical VL datasets across three tasks validate the effectiveness of M3AE-Distill. Two student variants, M3AE-Distill-Small and M3AE-Distill-Base, are provided to support a flexible trade-off between efficiency and accuracy. M3AE-Distill-Base consistently outperforms existing models and achieves performance comparable to the teacher model, while delivering 2.11× and 2.61× speedups during inference and fine-tuning, respectively.https://www.mdpi.com/2306-5354/12/7/738vision–languagedeep learningknowledge distillationpre-training |
| spellingShingle | Xudong Liang Jiang Xie Mengfei Zhang Zhuo Bi M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks Bioengineering vision–language deep learning knowledge distillation pre-training |
| title | M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks |
| title_full | M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks |
| title_fullStr | M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks |
| title_full_unstemmed | M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks |
| title_short | M3AE-Distill: An Efficient Distilled Model for Medical Vision–Language Downstream Tasks |
| title_sort | m3ae distill an efficient distilled model for medical vision language downstream tasks |
| topic | vision–language deep learning knowledge distillation pre-training |
| url | https://www.mdpi.com/2306-5354/12/7/738 |
| work_keys_str_mv | AT xudongliang m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks AT jiangxie m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks AT mengfeizhang m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks AT zhuobi m3aedistillanefficientdistilledmodelformedicalvisionlanguagedownstreamtasks |