MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models

Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspect...

Full description

Saved in:
Bibliographic Details
Main Authors: Chao Li, Yonghao Liao, Caichang Ding, Zhiwei Ye
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/1/258
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841548938376642560
author Chao Li
Yonghao Liao
Caichang Ding
Zhiwei Ye
author_facet Chao Li
Yonghao Liao
Caichang Ding
Zhiwei Ye
author_sort Chao Li
collection DOAJ
description Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>ϵ</mi><mo>=</mo><mn>4</mn><mo>/</mo><mn>255</mn></mrow></semantics></math></inline-formula>). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.
format Article
id doaj-art-9ea7d887a26940b5bd4a92a4543c8992
institution Kabale University
issn 1424-8220
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-9ea7d887a26940b5bd4a92a4543c89922025-01-10T13:21:23ZengMDPI AGSensors1424-82202025-01-0125125810.3390/s25010258MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language ModelsChao Li0Yonghao Liao1Caichang Ding2Zhiwei Ye3School of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaSchool of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaSchool of Computer and Information Science, Hubei Engineering University, Xiaogan 432000, ChinaSchool of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaLarge visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>ϵ</mi><mo>=</mo><mn>4</mn><mo>/</mo><mn>255</mn></mrow></semantics></math></inline-formula>). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.https://www.mdpi.com/1424-8220/25/1/258multi-modaladversarial robustnessvisual language modelsprompt tuning
spellingShingle Chao Li
Yonghao Liao
Caichang Ding
Zhiwei Ye
MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
Sensors
multi-modal
adversarial robustness
visual language models
prompt tuning
title MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_full MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_fullStr MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_full_unstemmed MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_short MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
title_sort mdapt multi modal depth adversarial prompt tuning to enhance the adversarial robustness of visual language models
topic multi-modal
adversarial robustness
visual language models
prompt tuning
url https://www.mdpi.com/1424-8220/25/1/258
work_keys_str_mv AT chaoli mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels
AT yonghaoliao mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels
AT caichangding mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels
AT zhiweiye mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels