MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models
Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspect...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Sensors |
Subjects: | |
Online Access: | https://www.mdpi.com/1424-8220/25/1/258 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1841548938376642560 |
---|---|
author | Chao Li Yonghao Liao Caichang Ding Zhiwei Ye |
author_facet | Chao Li Yonghao Liao Caichang Ding Zhiwei Ye |
author_sort | Chao Li |
collection | DOAJ |
description | Large visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>ϵ</mi><mo>=</mo><mn>4</mn><mo>/</mo><mn>255</mn></mrow></semantics></math></inline-formula>). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks. |
format | Article |
id | doaj-art-9ea7d887a26940b5bd4a92a4543c8992 |
institution | Kabale University |
issn | 1424-8220 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Sensors |
spelling | doaj-art-9ea7d887a26940b5bd4a92a4543c89922025-01-10T13:21:23ZengMDPI AGSensors1424-82202025-01-0125125810.3390/s25010258MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language ModelsChao Li0Yonghao Liao1Caichang Ding2Zhiwei Ye3School of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaSchool of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaSchool of Computer and Information Science, Hubei Engineering University, Xiaogan 432000, ChinaSchool of Computer Science, Hubei University of Technology, Wuhan 430068, ChinaLarge visual language models like Contrastive Language-Image Pre-training (CLIP), despite their excellent performance, are highly vulnerable to the influence of adversarial examples. This work investigates the accuracy and robustness of visual language models (VLMs) from a novel multi-modal perspective. We propose a multi-modal fine-tuning method called Multi-modal Depth Adversarial Prompt Tuning (MDAPT), which guides the generation of visual prompts through text prompts to improve the accuracy and performance of visual language models. We conducted extensive experiments and significantly improved performance on three datasets (<inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mi>ϵ</mi><mo>=</mo><mn>4</mn><mo>/</mo><mn>255</mn></mrow></semantics></math></inline-formula>). Compared with traditional manual design prompts, the accuracy and robustness increased by an average of 17.84% and 10.85%, respectively. Not only that, our method still has a very good performance improvement under different attack methods. With our efficient settings, compared with traditional manual prompts, our average accuracy and robustness are improved by 32.16% and 21.00%, respectively, under three different attacks.https://www.mdpi.com/1424-8220/25/1/258multi-modaladversarial robustnessvisual language modelsprompt tuning |
spellingShingle | Chao Li Yonghao Liao Caichang Ding Zhiwei Ye MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models Sensors multi-modal adversarial robustness visual language models prompt tuning |
title | MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models |
title_full | MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models |
title_fullStr | MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models |
title_full_unstemmed | MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models |
title_short | MDAPT: Multi-Modal Depth Adversarial Prompt Tuning to Enhance the Adversarial Robustness of Visual Language Models |
title_sort | mdapt multi modal depth adversarial prompt tuning to enhance the adversarial robustness of visual language models |
topic | multi-modal adversarial robustness visual language models prompt tuning |
url | https://www.mdpi.com/1424-8220/25/1/258 |
work_keys_str_mv | AT chaoli mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels AT yonghaoliao mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels AT caichangding mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels AT zhiweiye mdaptmultimodaldepthadversarialprompttuningtoenhancetheadversarialrobustnessofvisuallanguagemodels |