Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance

This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SS...

Full description

Saved in:
Bibliographic Details
Main Authors: Minjun Son, Sungjin Lee
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/7/3992
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850213013569667072
author Minjun Son
Sungjin Lee
author_facet Minjun Son
Sungjin Lee
author_sort Minjun Son
collection DOAJ
description This study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SSR), tree of thought (ToT), and retrieval-augmented generation (RAG). These techniques are systematically applied across multiple datasets with distinct domains and characteristics. Based on the empirical findings, we propose the greedy prompt engineering strategy (Greedy PES), a methodology for optimizing PE application across different datasets and MLLM models. To evaluate user satisfaction with MLLM-generated responses, we adopt a comprehensive set of evaluation metrics, including BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr. A weighted aggregate evaluation score is introduced to provide a holistic assessment of model performance under varying conditions. Experimental results demonstrate that the optimal prompt engineering strategy varies significantly depending on both dataset properties and the MLLM model used. Specifically, datasets categorized as general benefit the most from ICL, ToT, and RAG, whereas mathematical datasets perform optimally with ICL, SSR, and ToT. In scientific reasoning tasks, RAG and SSR emerge as the most effective strategies. Applying Greedy PES leads to a substantial improvement in performance across different multimodal tasks, achieving an average evaluation score enhancement of 184.3% for general image captioning, 90.3% for mathematical visual question answering (VQA), and 49.1% for science visual question answering (VQA) compared to conventional approaches. These findings highlight the effectiveness of structured PE strategies in optimizing MLLM performance and provide a robust framework for PE-driven model enhancement across diverse multimodal applications.
format Article
id doaj-art-08b36987defc412db7111b334fecde82
institution OA Journals
issn 2076-3417
language English
publishDate 2025-04-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-08b36987defc412db7111b334fecde822025-08-20T02:09:13ZengMDPI AGApplied Sciences2076-34172025-04-01157399210.3390/app15073992Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced PerformanceMinjun Son0Sungjin Lee1Department of Metabiohealth, Sungkyunkwan University, Suwon 16419, Republic of KoreaDepartment of Smart Automotive, Soonchunhyang University, Asan 31538, Republic of KoreaThis study investigates prompt engineering (PE) strategies to mitigate hallucination, a key limitation of multimodal large language models (MLLMs). To address this issue, we explore five prominent multimodal PE techniques: in-context learning (ICL), chain of thought (CoT), step-by-step reasoning (SSR), tree of thought (ToT), and retrieval-augmented generation (RAG). These techniques are systematically applied across multiple datasets with distinct domains and characteristics. Based on the empirical findings, we propose the greedy prompt engineering strategy (Greedy PES), a methodology for optimizing PE application across different datasets and MLLM models. To evaluate user satisfaction with MLLM-generated responses, we adopt a comprehensive set of evaluation metrics, including BLEU, ROUGE, METEOR, S-BERT, MoverScore, and CIDEr. A weighted aggregate evaluation score is introduced to provide a holistic assessment of model performance under varying conditions. Experimental results demonstrate that the optimal prompt engineering strategy varies significantly depending on both dataset properties and the MLLM model used. Specifically, datasets categorized as general benefit the most from ICL, ToT, and RAG, whereas mathematical datasets perform optimally with ICL, SSR, and ToT. In scientific reasoning tasks, RAG and SSR emerge as the most effective strategies. Applying Greedy PES leads to a substantial improvement in performance across different multimodal tasks, achieving an average evaluation score enhancement of 184.3% for general image captioning, 90.3% for mathematical visual question answering (VQA), and 49.1% for science visual question answering (VQA) compared to conventional approaches. These findings highlight the effectiveness of structured PE strategies in optimizing MLLM performance and provide a robust framework for PE-driven model enhancement across diverse multimodal applications.https://www.mdpi.com/2076-3417/15/7/3992multimodal large language modelprompt engineeringin-context learningchain of thoughtretrieval-augmented generationstep-by-step reasoning
spellingShingle Minjun Son
Sungjin Lee
Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance
Applied Sciences
multimodal large language model
prompt engineering
in-context learning
chain of thought
retrieval-augmented generation
step-by-step reasoning
title Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance
title_full Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance
title_fullStr Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance
title_full_unstemmed Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance
title_short Advancing Multimodal Large Language Models: Optimizing Prompt Engineering Strategies for Enhanced Performance
title_sort advancing multimodal large language models optimizing prompt engineering strategies for enhanced performance
topic multimodal large language model
prompt engineering
in-context learning
chain of thought
retrieval-augmented generation
step-by-step reasoning
url https://www.mdpi.com/2076-3417/15/7/3992
work_keys_str_mv AT minjunson advancingmultimodallargelanguagemodelsoptimizingpromptengineeringstrategiesforenhancedperformance
AT sungjinlee advancingmultimodallargelanguagemodelsoptimizingpromptengineeringstrategiesforenhancedperformance