A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach

Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Mu...

Full description

Saved in:
Bibliographic Details
Main Authors: Ghada Soliman, Ph.D., Hozaifa Zaki, Mohamed Kilany
Format: Article
Language:English
Published: Elsevier 2025-03-01
Series:Natural Language Processing Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S294971912500007X
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850059372511625216
author Ghada Soliman, Ph.D.
Hozaifa Zaki
Mohamed Kilany
author_facet Ghada Soliman, Ph.D.
Hozaifa Zaki
Mohamed Kilany
author_sort Ghada Soliman, Ph.D.
collection DOAJ
description Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.
format Article
id doaj-art-251d45023e214cdbbb5c3bd62b76ae62
institution DOAJ
issn 2949-7191
language English
publishDate 2025-03-01
publisher Elsevier
record_format Article
series Natural Language Processing Journal
spelling doaj-art-251d45023e214cdbbb5c3bd62b76ae622025-08-20T02:50:55ZengElsevierNatural Language Processing Journal2949-71912025-03-011010013110.1016/j.nlp.2025.100131A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approachGhada Soliman, Ph.D.0Hozaifa Zaki1Mohamed Kilany2Corresponding author.; Department of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptDepartment of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptDepartment of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptLarge Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.http://www.sciencedirect.com/science/article/pii/S294971912500007XNLPLLMSLMSelf-evaluationMCQ
spellingShingle Ghada Soliman, Ph.D.
Hozaifa Zaki
Mohamed Kilany
A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
Natural Language Processing Journal
NLP
LLM
SLM
Self-evaluation
MCQ
title A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_full A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_fullStr A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_full_unstemmed A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_short A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
title_sort comparative analysis of encoder only and decoder only models for challenging llm generated stem mcqs using a self evaluation approach
topic NLP
LLM
SLM
Self-evaluation
MCQ
url http://www.sciencedirect.com/science/article/pii/S294971912500007X
work_keys_str_mv AT ghadasolimanphd acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach
AT hozaifazaki acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach
AT mohamedkilany acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach
AT ghadasolimanphd comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach
AT hozaifazaki comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach
AT mohamedkilany comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach