Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy

The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing te...

Full description

Saved in:
Bibliographic Details
Main Authors: Hyein Seo, Taewook Hwang, Jeesu Jung, Hyeonseok Kang, Hyuk Namgoong, Yohan Lee, Sangkeun Jung
Format: Article
Language:English
Published: MDPI AG 2025-01-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/2/671
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832589297088200704
author Hyein Seo
Taewook Hwang
Jeesu Jung
Hyeonseok Kang
Hyuk Namgoong
Yohan Lee
Sangkeun Jung
author_facet Hyein Seo
Taewook Hwang
Jeesu Jung
Hyeonseok Kang
Hyuk Namgoong
Yohan Lee
Sangkeun Jung
author_sort Hyein Seo
collection DOAJ
description The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings.
format Article
id doaj-art-e314fae0aae34837b5a7b0a0ec9ca523
institution Kabale University
issn 2076-3417
language English
publishDate 2025-01-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-e314fae0aae34837b5a7b0a0ec9ca5232025-01-24T13:20:23ZengMDPI AGApplied Sciences2076-34172025-01-0115267110.3390/app15020671Large Language Models as Evaluators in Education: Verification of Feedback Consistency and AccuracyHyein Seo0Taewook Hwang1Jeesu Jung2Hyeonseok Kang3Hyuk Namgoong4Yohan Lee5Sangkeun Jung6Computer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaElectronics and Telecommunications Research Institute, Daejeon 34129, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaThe recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings.https://www.mdpi.com/2076-3417/15/2/671educationLLMs-as-evaluatorsLLMs-as-judgesfeedback generationfeedback evaluationlarge language models
spellingShingle Hyein Seo
Taewook Hwang
Jeesu Jung
Hyeonseok Kang
Hyuk Namgoong
Yohan Lee
Sangkeun Jung
Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
Applied Sciences
education
LLMs-as-evaluators
LLMs-as-judges
feedback generation
feedback evaluation
large language models
title Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_full Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_fullStr Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_full_unstemmed Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_short Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
title_sort large language models as evaluators in education verification of feedback consistency and accuracy
topic education
LLMs-as-evaluators
LLMs-as-judges
feedback generation
feedback evaluation
large language models
url https://www.mdpi.com/2076-3417/15/2/671
work_keys_str_mv AT hyeinseo largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy
AT taewookhwang largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy
AT jeesujung largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy
AT hyeonseokkang largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy
AT hyuknamgoong largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy
AT yohanlee largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy
AT sangkeunjung largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy