Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy

The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing te...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hyein Seo, Taewook Hwang, Jeesu Jung, Hyeonseok Kang, Hyuk Namgoong, Yohan Lee, Sangkeun Jung
Format:	Article
Language:	English
Published:	MDPI AG 2025-01-01
Series:	Applied Sciences
Subjects:	education LLMs-as-evaluators LLMs-as-judges feedback generation feedback evaluation large language models
Online Access:	https://www.mdpi.com/2076-3417/15/2/671
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832589297088200704
author	Hyein Seo Taewook Hwang Jeesu Jung Hyeonseok Kang Hyuk Namgoong Yohan Lee Sangkeun Jung
author_facet	Hyein Seo Taewook Hwang Jeesu Jung Hyeonseok Kang Hyuk Namgoong Yohan Lee Sangkeun Jung
author_sort	Hyein Seo
collection	DOAJ
description	The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings.
format	Article
id	doaj-art-e314fae0aae34837b5a7b0a0ec9ca523
institution	Kabale University
issn	2076-3417
language	English
publishDate	2025-01-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-e314fae0aae34837b5a7b0a0ec9ca5232025-01-24T13:20:23ZengMDPI AGApplied Sciences2076-34172025-01-0115267110.3390/app15020671Large Language Models as Evaluators in Education: Verification of Feedback Consistency and AccuracyHyein Seo0Taewook Hwang1Jeesu Jung2Hyeonseok Kang3Hyuk Namgoong4Yohan Lee5Sangkeun Jung6Computer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaElectronics and Telecommunications Research Institute, Daejeon 34129, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaThe recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings.https://www.mdpi.com/2076-3417/15/2/671educationLLMs-as-evaluatorsLLMs-as-judgesfeedback generationfeedback evaluationlarge language models
spellingShingle	Hyein Seo Taewook Hwang Jeesu Jung Hyeonseok Kang Hyuk Namgoong Yohan Lee Sangkeun Jung Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy Applied Sciences education LLMs-as-evaluators LLMs-as-judges feedback generation feedback evaluation large language models
title	Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_full	Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_fullStr	Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_full_unstemmed	Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_short	Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_sort	large language models as evaluators in education verification of feedback consistency and accuracy
topic	education LLMs-as-evaluators LLMs-as-judges feedback generation feedback evaluation large language models
url	https://www.mdpi.com/2076-3417/15/2/671
work_keys_str_mv	AT hyeinseo largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT taewookhwang largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT jeesujung largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT hyeonseokkang largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT hyuknamgoong largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT yohanlee largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT sangkeunjung largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy

Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy

Similar Items