A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach
Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Mu...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-03-01
|
| Series: | Natural Language Processing Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S294971912500007X |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850059372511625216 |
|---|---|
| author | Ghada Soliman, Ph.D. Hozaifa Zaki Mohamed Kilany |
| author_facet | Ghada Soliman, Ph.D. Hozaifa Zaki Mohamed Kilany |
| author_sort | Ghada Soliman, Ph.D. |
| collection | DOAJ |
| description | Large Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance. |
| format | Article |
| id | doaj-art-251d45023e214cdbbb5c3bd62b76ae62 |
| institution | DOAJ |
| issn | 2949-7191 |
| language | English |
| publishDate | 2025-03-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Natural Language Processing Journal |
| spelling | doaj-art-251d45023e214cdbbb5c3bd62b76ae622025-08-20T02:50:55ZengElsevierNatural Language Processing Journal2949-71912025-03-011010013110.1016/j.nlp.2025.100131A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approachGhada Soliman, Ph.D.0Hozaifa Zaki1Mohamed Kilany2Corresponding author.; Department of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptDepartment of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptDepartment of Artificial Intelligence, Orange Innovation Egypt, Cariro, EgyptLarge Language Models (LLMs) have demonstrated impressive capabilities in various tasks, including Multiple-Choice Question Answering (MCQA) evaluated on benchmark datasets with few-shot prompting. Given the absence of benchmark Science, Technology, Engineering, and Mathematics (STEM) datasets on Multiple-Choice Questions (MCQs) created by LLMs, we employed various LLMs (e.g., Vicuna-13B, Bard, and GPT-3.5) to generate MCQs on STEM topics curated from Wikipedia. We evaluated open-source LLM models such as Llama 2-7B and Mistral-7B Instruct, along with an encoder model such as DeBERTa v3 Large, on inference by adding context in addition to fine-tuning with and without context. The results showed that DeBERTa v3 Large and Mistral-7B Instruct outperform Llama 2-7B, highlighting the potential of LLMs with fewer parameters in answering hard MCQs when given the appropriate context through fine-tuning. We also benchmarked the results of these models against closed-source models such as Gemini and GPT-4 on inference with context, showcasing the potential of narrowing the gap between open-source and closed-source models when context is provided. Our work demonstrates the capabilities of LLMs in creating more challenging tasks that can be used as self-evaluation for other models. It also contributes to understanding LLMs’ capabilities in STEM MCQs tasks and emphasizes the importance of context for LLMs with fewer parameters in enhancing their performance.http://www.sciencedirect.com/science/article/pii/S294971912500007XNLPLLMSLMSelf-evaluationMCQ |
| spellingShingle | Ghada Soliman, Ph.D. Hozaifa Zaki Mohamed Kilany A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach Natural Language Processing Journal NLP LLM SLM Self-evaluation MCQ |
| title | A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach |
| title_full | A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach |
| title_fullStr | A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach |
| title_full_unstemmed | A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach |
| title_short | A comparative analysis of encoder only and decoder only models for challenging LLM-generated STEM MCQs using a self-evaluation approach |
| title_sort | comparative analysis of encoder only and decoder only models for challenging llm generated stem mcqs using a self evaluation approach |
| topic | NLP LLM SLM Self-evaluation MCQ |
| url | http://www.sciencedirect.com/science/article/pii/S294971912500007X |
| work_keys_str_mv | AT ghadasolimanphd acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT hozaifazaki acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT mohamedkilany acomparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT ghadasolimanphd comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT hozaifazaki comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach AT mohamedkilany comparativeanalysisofencoderonlyanddecoderonlymodelsforchallengingllmgeneratedstemmcqsusingaselfevaluationapproach |