Mixture of prompts learning for vision-language models
As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requires a small number of parameters. H...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Frontiers Media S.A.
2025-06-01
|
| Series: | Frontiers in Artificial Intelligence |
| Subjects: | |
| Online Access: | https://www.frontiersin.org/articles/10.3389/frai.2025.1580973/full |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849336635634745344 |
|---|---|
| author | Yu Du Yu Du Yu Du Yu Du Yu Du Tong Niu Tong Niu Tong Niu Tong Niu Tong Niu Rong Zhao Rong Zhao Rong Zhao Rong Zhao Rong Zhao |
| author_facet | Yu Du Yu Du Yu Du Yu Du Yu Du Tong Niu Tong Niu Tong Niu Tong Niu Tong Niu Rong Zhao Rong Zhao Rong Zhao Rong Zhao Rong Zhao |
| author_sort | Yu Du |
| collection | DOAJ |
| description | As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requires a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture-of-prompts learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically select the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retains knowledge from hard prompts and improves selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applying a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. Our approach establishes that multi-prompt specialization with knowledge-preserving routing effectively bridges the adaptability-generalization tradeoff in VLM deployment. The code will be available at https://github.com/dyabel/mocoop. |
| format | Article |
| id | doaj-art-3a274e4e41be42deabf11a148a44efa7 |
| institution | Kabale University |
| issn | 2624-8212 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | Frontiers Media S.A. |
| record_format | Article |
| series | Frontiers in Artificial Intelligence |
| spelling | doaj-art-3a274e4e41be42deabf11a148a44efa72025-08-20T03:44:55ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122025-06-01810.3389/frai.2025.15809731580973Mixture of prompts learning for vision-language modelsYu Du0Yu Du1Yu Du2Yu Du3Yu Du4Tong Niu5Tong Niu6Tong Niu7Tong Niu8Tong Niu9Rong Zhao10Rong Zhao11Rong Zhao12Rong Zhao13Rong Zhao14Center for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, ChinaOptical Memory National Engineering Research Center, Tsinghua University, Beijing, ChinaDepartment of Precision Instrument, Tsinghua University, Beijing, ChinaIDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, ChinaCETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, ChinaCenter for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, ChinaOptical Memory National Engineering Research Center, Tsinghua University, Beijing, ChinaDepartment of Precision Instrument, Tsinghua University, Beijing, ChinaIDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, ChinaCETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, ChinaCenter for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, ChinaOptical Memory National Engineering Research Center, Tsinghua University, Beijing, ChinaDepartment of Precision Instrument, Tsinghua University, Beijing, ChinaIDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, ChinaCETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, ChinaAs powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requires a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture-of-prompts learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically select the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retains knowledge from hard prompts and improves selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applying a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. Our approach establishes that multi-prompt specialization with knowledge-preserving routing effectively bridges the adaptability-generalization tradeoff in VLM deployment. The code will be available at https://github.com/dyabel/mocoop.https://www.frontiersin.org/articles/10.3389/frai.2025.1580973/fullprompt learningvision-language modelmixture-of-expertsmulti-modalfew-shot classification |
| spellingShingle | Yu Du Yu Du Yu Du Yu Du Yu Du Tong Niu Tong Niu Tong Niu Tong Niu Tong Niu Rong Zhao Rong Zhao Rong Zhao Rong Zhao Rong Zhao Mixture of prompts learning for vision-language models Frontiers in Artificial Intelligence prompt learning vision-language model mixture-of-experts multi-modal few-shot classification |
| title | Mixture of prompts learning for vision-language models |
| title_full | Mixture of prompts learning for vision-language models |
| title_fullStr | Mixture of prompts learning for vision-language models |
| title_full_unstemmed | Mixture of prompts learning for vision-language models |
| title_short | Mixture of prompts learning for vision-language models |
| title_sort | mixture of prompts learning for vision language models |
| topic | prompt learning vision-language model mixture-of-experts multi-modal few-shot classification |
| url | https://www.frontiersin.org/articles/10.3389/frai.2025.1580973/full |
| work_keys_str_mv | AT yudu mixtureofpromptslearningforvisionlanguagemodels AT yudu mixtureofpromptslearningforvisionlanguagemodels AT yudu mixtureofpromptslearningforvisionlanguagemodels AT yudu mixtureofpromptslearningforvisionlanguagemodels AT yudu mixtureofpromptslearningforvisionlanguagemodels AT tongniu mixtureofpromptslearningforvisionlanguagemodels AT tongniu mixtureofpromptslearningforvisionlanguagemodels AT tongniu mixtureofpromptslearningforvisionlanguagemodels AT tongniu mixtureofpromptslearningforvisionlanguagemodels AT tongniu mixtureofpromptslearningforvisionlanguagemodels AT rongzhao mixtureofpromptslearningforvisionlanguagemodels AT rongzhao mixtureofpromptslearningforvisionlanguagemodels AT rongzhao mixtureofpromptslearningforvisionlanguagemodels AT rongzhao mixtureofpromptslearningforvisionlanguagemodels AT rongzhao mixtureofpromptslearningforvisionlanguagemodels |