MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models

Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspect...

Full description

Saved in:

Bibliographic Details
Main Authors:	Chao Li, Yonghao Liao, Caichang Ding, Zhiwei Ye
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Sensors
Subjects:	multi-modal adversarial robustness visual language models prompt tuning
Online Access:	https://www.mdpi.com/1424-8220/25/1/258
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1841548938376642560
author	Chao Li Yonghao Liao Caichang Ding Zhiwei Ye
author_facet	Chao Li Yonghao Liao Caichang Ding Zhiwei Ye
author_sort	Chao Li
collection	DOAJ
description	Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>ϵ</mi><mo>=</mo><mn>4</mn><mo>/</mo><mn>255</mn></mrow></semantics></math></inline-formula>). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.
format	Article
id	doaj-art-9ea7d887a26940b5bd4a92a4543c8992
institution	Kabale University
issn	1424-8220
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Sensors
spelling	doaj-art-9ea7d887a26940b5bd4a92a4543c89922025-01-10T13:21:23ZengMDPI AGSensors1424-82202025-01-0125125810.3390/s25010258MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language ModelsChao Li0Yonghao Liao1Caichang Ding2Zhiwei Ye3School of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaSchool of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaSchool of Computer and Information Science, Hubei Engineering University, Xiaogan 432000, ChinaSchool of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaLarge visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>ϵ</mi><mo>=</mo><mn>4</mn><mo>/</mo><mn>255</mn></mrow></semantics></math></inline-formula>). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.https://www.mdpi.com/1424-8220/25/1/258multi-modaladversarial robustnessvisual language modelsprompt tuning
spellingShingle	Chao Li Yonghao Liao Caichang Ding Zhiwei Ye MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models Sensors multi-modal adversarial robustness visual language models prompt tuning
title	MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_full	MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_fullStr	MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_full_unstemmed	MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_short	MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_sort	mdapt multi modal depth adversarial prompt tuning to enhance the adversarial robustness of visual language models
topic	multi-modal adversarial robustness visual language models prompt tuning
url	https://www.mdpi.com/1424-8220/25/1/258
work_keys_str_mv	AT chaoli mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels AT yonghaoliao mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels AT caichangding mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels AT zhiweiye mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels

MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models

Similar Items