Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance
This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SS...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/7/3992 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850213013569667072 |
|---|---|
| author | Minjun Son Sungjin Lee |
| author_facet | Minjun Son Sungjin Lee |
| author_sort | Minjun Son |
| collection | DOAJ |
| description | This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SSR), tree of thought (ToT), and retrieval-augmented generation (RAG). These techniques are systematically applied across multiple datasets with distinct domains and characteristics. Based on the empirical findings, we propose the greedy prompt engineering strategy (Greedy PES), a methodology for optimizing PE application across different datasets and MLLM models. To evaluate user satisfaction with MLLM-generated responses, we adopt a comprehensive set of evaluation metrics, including BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr. A weighted aggregate evaluation score is introduced to provide a holistic assessment of model performance under varying conditions. Experimental results demonstrate that the optimal prompt engineering strategy varies significantly depending on both dataset properties and the MLLM model used. Specifically, datasets categorized as general benefit the most from ICL, ToT, and RAG, whereas mathematical datasets perform optimally with ICL, SSR, and ToT. In scientific reasoning tasks, RAG and SSR emerge as the most effective strategies. Applying Greedy PES leads to a substantial improvement in performance across different multimodal tasks, achieving an average evaluation score enhancement of 184.3% for general image captioning, 90.3% for mathematical visual question answering (VQA), and 49.1% for science visual question answering (VQA) compared to conventional approaches. These findings highlight the effectiveness of structured PE strategies in optimizing MLLM performance and provide a robust framework for PE-driven model enhancement across diverse multimodal applications. |
| format | Article |
| id | doaj-art-08b36987defc412db7111b334fecde82 |
| institution | OA Journals |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-08b36987defc412db7111b334fecde822025-08-20T02:09:13ZengMDPI AGApplied Sciences2076-34172025-04-01157399210.3390/app15073992Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced PerformanceMinjun Son0Sungjin Lee1Department of Metabiohealth, Sungkyunkwan University, Suwon 16419, Republic of KoreaDepartment of Smart Automotive, Soonchunhyang University, Asan 31538, Republic of KoreaThis study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SSR), tree of thought (ToT), and retrieval-augmented generation (RAG). These techniques are systematically applied across multiple datasets with distinct domains and characteristics. Based on the empirical findings, we propose the greedy prompt engineering strategy (Greedy PES), a methodology for optimizing PE application across different datasets and MLLM models. To evaluate user satisfaction with MLLM-generated responses, we adopt a comprehensive set of evaluation metrics, including BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr. A weighted aggregate evaluation score is introduced to provide a holistic assessment of model performance under varying conditions. Experimental results demonstrate that the optimal prompt engineering strategy varies significantly depending on both dataset properties and the MLLM model used. Specifically, datasets categorized as general benefit the most from ICL, ToT, and RAG, whereas mathematical datasets perform optimally with ICL, SSR, and ToT. In scientific reasoning tasks, RAG and SSR emerge as the most effective strategies. Applying Greedy PES leads to a substantial improvement in performance across different multimodal tasks, achieving an average evaluation score enhancement of 184.3% for general image captioning, 90.3% for mathematical visual question answering (VQA), and 49.1% for science visual question answering (VQA) compared to conventional approaches. These findings highlight the effectiveness of structured PE strategies in optimizing MLLM performance and provide a robust framework for PE-driven model enhancement across diverse multimodal applications.https://www.mdpi.com/2076-3417/15/7/3992multimodal large language modelprompt engineeringin-context learningchain of thoughtretrieval-augmented generationstep-by-step reasoning |
| spellingShingle | Minjun Son Sungjin Lee Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance Applied Sciences multimodal large language model prompt engineering in-context learning chain of thought retrieval-augmented generation step-by-step reasoning |
| title | Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance |
| title_full | Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance |
| title_fullStr | Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance |
| title_full_unstemmed | Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance |
| title_short | Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance |
| title_sort | advancing multimodal large language models optimizing prompt engineering strategies for enhanced performance |
| topic | multimodal large language model prompt engineering in-context learning chain of thought retrieval-augmented generation step-by-step reasoning |
| url | https://www.mdpi.com/2076-3417/15/7/3992 |
| work_keys_str_mv | AT minjunson advancingmultimodallargelanguagemodelsoptimizingpromptengineeringstrategiesforenhancedperformance AT sungjinlee advancingmultimodallargelanguagemodelsoptimizingpromptengineeringstrategiesforenhancedperformance |