A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach

Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Mu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ghada Soliman, Ph.D., Hozaifa Zaki, Mohamed Kilany
Format:	Article
Language:	English
Published:	Elsevier 2025-03-01
Series:	Natural Language Processing Journal
Subjects:	NLP LLM SLM Self-evaluation MCQ
Online Access:	http://www.sciencedirect.com/science/article/pii/S294971912500007X
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850059372511625216
author	Ghada Soliman, Ph.D. Hozaifa Zaki Mohamed Kilany
author_facet	Ghada Soliman, Ph.D. Hozaifa Zaki Mohamed Kilany
author_sort	Ghada Soliman, Ph.D.
collection	DOAJ
description	Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.
format	Article
id	doaj-art-251d45023e214cdbbb5c3bd62b76ae62
institution	DOAJ
issn	2949-7191
language	English
publishDate	2025-03-01
publisher	Elsevier
record_format	Article
series	Natural Language Processing Journal
spelling	doaj-art-251d45023e214cdbbb5c3bd62b76ae622025-08-20T02:50:55ZengElsevierNatural Language Processing Journal2949-71912025-03-011010013110.1016/j.nlp.2025.100131A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approachGhada Soliman, Ph.D.0Hozaifa Zaki1Mohamed Kilany2Corresponding author.; Department of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptDepartment of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptDepartment of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptLarge Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.http://www.sciencedirect.com/science/article/pii/S294971912500007XNLPLLMSLMSelf-evaluationMCQ
spellingShingle	Ghada Soliman, Ph.D. Hozaifa Zaki Mohamed Kilany A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach Natural Language Processing Journal NLP LLM SLM Self-evaluation MCQ
title	A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_full	A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_fullStr	A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_full_unstemmed	A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_short	A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_sort	comparative analysis of encoder only and decoder only models for challenging llm generated stem mcqs using a self evaluation approach
topic	NLP LLM SLM Self-evaluation MCQ
url	http://www.sciencedirect.com/science/article/pii/S294971912500007X
work_keys_str_mv	AT ghadasolimanphd acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT hozaifazaki acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT mohamedkilany acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT ghadasolimanphd comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT hozaifazaki comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT mohamedkilany comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach

A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach

Similar Items