Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-05-01
|
| Series: | Education Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2227-7102/15/6/676 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849433152397770752 |
|---|---|
| author | Yunlong Pan Ross H. Nehm |
| author_facet | Yunlong Pan Ross H. Nehm |
| author_sort | Yunlong Pan |
| collection | DOAJ |
| description | Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation. |
| format | Article |
| id | doaj-art-2efdb0fb1ebd40feb566fde48f31837f |
| institution | Kabale University |
| issn | 2227-7102 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Education Sciences |
| spelling | doaj-art-2efdb0fb1ebd40feb566fde48f31837f2025-08-20T03:27:10ZengMDPI AGEducation Sciences2227-71022025-05-0115667610.3390/educsci15060676Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and DrawbacksYunlong Pan0Ross H. Nehm1Department of Applied Mathematics and Statistics, College of Engineering, Stony Brook University, Stony Brook, NY 11794, USADepartment of Applied Mathematics and Statistics, College of Engineering, Stony Brook University, Stony Brook, NY 11794, USAFew studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.https://www.mdpi.com/2227-7102/15/6/676large language models (LLMs)machine learning (ML)automated scoringscience assessmentaccuracy metricsethical implications |
| spellingShingle | Yunlong Pan Ross H. Nehm Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks Education Sciences large language models (LLMs) machine learning (ML) automated scoring science assessment accuracy metrics ethical implications |
| title | Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks |
| title_full | Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks |
| title_fullStr | Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks |
| title_full_unstemmed | Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks |
| title_short | Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks |
| title_sort | large language model and traditional machine learning scoring of evolutionary explanations benefits and drawbacks |
| topic | large language models (LLMs) machine learning (ML) automated scoring science assessment accuracy metrics ethical implications |
| url | https://www.mdpi.com/2227-7102/15/6/676 |
| work_keys_str_mv | AT yunlongpan largelanguagemodelandtraditionalmachinelearningscoringofevolutionaryexplanationsbenefitsanddrawbacks AT rosshnehm largelanguagemodelandtraditionalmachinelearningscoringofevolutionaryexplanationsbenefitsanddrawbacks |