A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach

Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Mu...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ghada Soliman, Ph.D., Hozaifa Zaki, Mohamed Kilany
Format:	Article
Language:	English
Published:	Elsevier 2025-03-01
Series:	Natural Language Processing Journal
Subjects:	NLP LLM SLM Self-evaluation MCQ
Online Access:	http://www.sciencedirect.com/science/article/pii/S294971912500007X
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.
ISSN:	2949-7191

A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach

Similar Items