Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy
The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing te...
Saved in:
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2025-01-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/15/2/671 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832589297088200704 |
---|---|
author | Hyein Seo Taewook Hwang Jeesu Jung Hyeonseok Kang Hyuk Namgoong Yohan Lee Sangkeun Jung |
author_facet | Hyein Seo Taewook Hwang Jeesu Jung Hyeonseok Kang Hyuk Namgoong Yohan Lee Sangkeun Jung |
author_sort | Hyein Seo |
collection | DOAJ |
description | The recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings. |
format | Article |
id | doaj-art-e314fae0aae34837b5a7b0a0ec9ca523 |
institution | Kabale University |
issn | 2076-3417 |
language | English |
publishDate | 2025-01-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj-art-e314fae0aae34837b5a7b0a0ec9ca5232025-01-24T13:20:23ZengMDPI AGApplied Sciences2076-34172025-01-0115267110.3390/app15020671Large Language Models as Evaluators in Education: Verification of Feedback Consistency and AccuracyHyein Seo0Taewook Hwang1Jeesu Jung2Hyeonseok Kang3Hyuk Namgoong4Yohan Lee5Sangkeun Jung6Computer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaElectronics and Telecommunications Research Institute, Daejeon 34129, Republic of KoreaComputer Science and Engineering, Chungnam National University, Daejeon 34134, Republic of KoreaThe recent advancements in large language models (LLMs) have brought significant changes to the field of education, particularly in the generation and evaluation of feedback. LLMs are transforming education by streamlining tasks like content creation, feedback generation, and assessment, reducing teachers’ workload and improving online education efficiency. This study aimed to verify the consistency and reliability of LLMs as evaluators by conducting automated evaluations using various LLMs based on five educational evaluation criteria. The analysis revealed that while LLMs were capable of performing consistent evaluations under certain conditions, a lack of consistency was observed both among evaluators and across models for other criteria. Notably, low agreement among human evaluators correlated with reduced reliability in LLM evaluations. Furthermore, variations in evaluation results were influenced by factors such as prompt strategies and model architecture, highlighting the complexity of achieving reliable assessments using LLMs. These findings suggest that while LLMs have the potential to transform educational systems, careful selection and combination of models are essential to improve their consistency and align their performance with human evaluators in educational settings.https://www.mdpi.com/2076-3417/15/2/671educationLLMs-as-evaluatorsLLMs-as-judgesfeedback generationfeedback evaluationlarge language models |
spellingShingle | Hyein Seo Taewook Hwang Jeesu Jung Hyeonseok Kang Hyuk Namgoong Yohan Lee Sangkeun Jung Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy Applied Sciences education LLMs-as-evaluators LLMs-as-judges feedback generation feedback evaluation large language models |
title | Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy |
title_full | Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy |
title_fullStr | Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy |
title_full_unstemmed | Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy |
title_short | Large Language Models as Evaluators in Education: Verification of Feedback Consistency and Accuracy |
title_sort | large language models as evaluators in education verification of feedback consistency and accuracy |
topic | education LLMs-as-evaluators LLMs-as-judges feedback generation feedback evaluation large language models |
url | https://www.mdpi.com/2076-3417/15/2/671 |
work_keys_str_mv | AT hyeinseo largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT taewookhwang largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT jeesujung largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT hyeonseokkang largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT hyuknamgoong largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT yohanlee largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy AT sangkeunjung largelanguagemodelsasevaluatorsineducationverificationoffeedbackconsistencyandaccuracy |