Mixture of prompts learning for vision-language models

As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requires a small number of parameters. H...

Full description

Saved in:
Bibliographic Details
Main Authors: Yu Du, Tong Niu, Rong Zhao
Format: Article
Language:English
Published: Frontiers Media S.A. 2025-06-01
Series:Frontiers in Artificial Intelligence
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/frai.2025.1580973/full
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849336635634745344
author Yu Du
Yu Du
Yu Du
Yu Du
Yu Du
Tong Niu
Tong Niu
Tong Niu
Tong Niu
Tong Niu
Rong Zhao
Rong Zhao
Rong Zhao
Rong Zhao
Rong Zhao
author_facet Yu Du
Yu Du
Yu Du
Yu Du
Yu Du
Tong Niu
Tong Niu
Tong Niu
Tong Niu
Tong Niu
Rong Zhao
Rong Zhao
Rong Zhao
Rong Zhao
Rong Zhao
author_sort Yu Du
collection DOAJ
description As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requires a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture-of-prompts learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically select the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retains knowledge from hard prompts and improves selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applying a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. Our approach establishes that multi-prompt specialization with knowledge-preserving routing effectively bridges the adaptability-generalization tradeoff in VLM deployment. The code will be available at https://github.com/dyabel/mocoop.
format Article
id doaj-art-3a274e4e41be42deabf11a148a44efa7
institution Kabale University
issn 2624-8212
language English
publishDate 2025-06-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Artificial Intelligence
spelling doaj-art-3a274e4e41be42deabf11a148a44efa72025-08-20T03:44:55ZengFrontiers Media S.A.Frontiers in Artificial Intelligence2624-82122025-06-01810.3389/frai.2025.15809731580973Mixture of prompts learning for vision-language modelsYu Du0Yu Du1Yu Du2Yu Du3Yu Du4Tong Niu5Tong Niu6Tong Niu7Tong Niu8Tong Niu9Rong Zhao10Rong Zhao11Rong Zhao12Rong Zhao13Rong Zhao14Center for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, ChinaOptical Memory National Engineering Research Center, Tsinghua University, Beijing, ChinaDepartment of Precision Instrument, Tsinghua University, Beijing, ChinaIDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, ChinaCETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, ChinaCenter for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, ChinaOptical Memory National Engineering Research Center, Tsinghua University, Beijing, ChinaDepartment of Precision Instrument, Tsinghua University, Beijing, ChinaIDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, ChinaCETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, ChinaCenter for Brain-Inspired Computing Research (CBICR), Tsinghua University, Beijing, ChinaOptical Memory National Engineering Research Center, Tsinghua University, Beijing, ChinaDepartment of Precision Instrument, Tsinghua University, Beijing, ChinaIDG/McGovern Institute for Brain Research, Tsinghua University, Beijing, ChinaCETC Haikang Group-Brain Inspired Computing Joint Research Center, Beijing, ChinaAs powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requires a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture-of-prompts learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically select the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retains knowledge from hard prompts and improves selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applying a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. Our approach establishes that multi-prompt specialization with knowledge-preserving routing effectively bridges the adaptability-generalization tradeoff in VLM deployment. The code will be available at https://github.com/dyabel/mocoop.https://www.frontiersin.org/articles/10.3389/frai.2025.1580973/fullprompt learningvision-language modelmixture-of-expertsmulti-modalfew-shot classification
spellingShingle Yu Du
Yu Du
Yu Du
Yu Du
Yu Du
Tong Niu
Tong Niu
Tong Niu
Tong Niu
Tong Niu
Rong Zhao
Rong Zhao
Rong Zhao
Rong Zhao
Rong Zhao
Mixture of prompts learning for vision-language models
Frontiers in Artificial Intelligence
prompt learning
vision-language model
mixture-of-experts
multi-modal
few-shot classification
title Mixture of prompts learning for vision-language models
title_full Mixture of prompts learning for vision-language models
title_fullStr Mixture of prompts learning for vision-language models
title_full_unstemmed Mixture of prompts learning for vision-language models
title_short Mixture of prompts learning for vision-language models
title_sort mixture of prompts learning for vision language models
topic prompt learning
vision-language model
mixture-of-experts
multi-modal
few-shot classification
url https://www.frontiersin.org/articles/10.3389/frai.2025.1580973/full
work_keys_str_mv AT yudu mixtureofpromptslearningforvisionlanguagemodels
AT yudu mixtureofpromptslearningforvisionlanguagemodels
AT yudu mixtureofpromptslearningforvisionlanguagemodels
AT yudu mixtureofpromptslearningforvisionlanguagemodels
AT yudu mixtureofpromptslearningforvisionlanguagemodels
AT tongniu mixtureofpromptslearningforvisionlanguagemodels
AT tongniu mixtureofpromptslearningforvisionlanguagemodels
AT tongniu mixtureofpromptslearningforvisionlanguagemodels
AT tongniu mixtureofpromptslearningforvisionlanguagemodels
AT tongniu mixtureofpromptslearningforvisionlanguagemodels
AT rongzhao mixtureofpromptslearningforvisionlanguagemodels
AT rongzhao mixtureofpromptslearningforvisionlanguagemodels
AT rongzhao mixtureofpromptslearningforvisionlanguagemodels
AT rongzhao mixtureofpromptslearningforvisionlanguagemodels
AT rongzhao mixtureofpromptslearningforvisionlanguagemodels