Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks

Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study...

Full description

Saved in:
Bibliographic Details
Main Authors: Yunlong Pan, Ross H. Nehm
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Education Sciences
Subjects:
Online Access:https://www.mdpi.com/2227-7102/15/6/676
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849433152397770752
author Yunlong Pan
Ross H. Nehm
author_facet Yunlong Pan
Ross H. Nehm
author_sort Yunlong Pan
collection DOAJ
description Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.
format Article
id doaj-art-2efdb0fb1ebd40feb566fde48f31837f
institution Kabale University
issn 2227-7102
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Education Sciences
spelling doaj-art-2efdb0fb1ebd40feb566fde48f31837f2025-08-20T03:27:10ZengMDPI AGEducation Sciences2227-71022025-05-0115667610.3390/educsci15060676Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and DrawbacksYunlong Pan0Ross H. Nehm1Department of Applied Mathematics and Statistics, College of Engineering, Stony Brook University, Stony Brook, NY 11794, USADepartment of Applied Mathematics and Statistics, College of Engineering, Stony Brook University, Stony Brook, NY 11794, USAFew studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.https://www.mdpi.com/2227-7102/15/6/676large language models (LLMs)machine learning (ML)automated scoringscience assessmentaccuracy metricsethical implications
spellingShingle Yunlong Pan
Ross H. Nehm
Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
Education Sciences
large language models (LLMs)
machine learning (ML)
automated scoring
science assessment
accuracy metrics
ethical implications
title Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_full Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_fullStr Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_full_unstemmed Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_short Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_sort large language model and traditional machine learning scoring of evolutionary explanations benefits and drawbacks
topic large language models (LLMs)
machine learning (ML)
automated scoring
science assessment
accuracy metrics
ethical implications
url https://www.mdpi.com/2227-7102/15/6/676
work_keys_str_mv AT yunlongpan largelanguagemodelandtraditionalmachinelearningscoringofevolutionaryexplanationsbenefitsanddrawbacks
AT rosshnehm largelanguagemodelandtraditionalmachinelearningscoringofevolutionaryexplanationsbenefitsanddrawbacks