Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks

Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yunlong Pan, Ross H. Nehm
Format:	Article
Language:	English
Published:	MDPI AG 2025-05-01
Series:	Education Sciences
Subjects:	large language models (LLMs) machine learning (ML) automated scoring science assessment accuracy metrics ethical implications
Online Access:	https://www.mdpi.com/2227-7102/15/6/676
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849433152397770752
author	Yunlong Pan Ross H. Nehm
author_facet	Yunlong Pan Ross H. Nehm
author_sort	Yunlong Pan
collection	DOAJ
description	Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.
format	Article
id	doaj-art-2efdb0fb1ebd40feb566fde48f31837f
institution	Kabale University
issn	2227-7102
language	English
publishDate	2025-05-01
publisher	MDPI AG
record_format	Article
series	Education Sciences
spelling	doaj-art-2efdb0fb1ebd40feb566fde48f31837f2025-08-20T03:27:10ZengMDPI AGEducation Sciences2227-71022025-05-0115667610.3390/educsci15060676Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and DrawbacksYunlong Pan0Ross H. Nehm1Department of Applied Mathematics and Statistics, College of Engineering, Stony Brook University, Stony Brook, NY 11794, USADepartment of Applied Mathematics and Statistics, College of Engineering, Stony Brook University, Stony Brook, NY 11794, USAFew studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation.https://www.mdpi.com/2227-7102/15/6/676large language models (LLMs)machine learning (ML)automated scoringscience assessmentaccuracy metricsethical implications
spellingShingle	Yunlong Pan Ross H. Nehm Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks Education Sciences large language models (LLMs) machine learning (ML) automated scoring science assessment accuracy metrics ethical implications
title	Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_full	Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_fullStr	Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_full_unstemmed	Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_short	Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks
title_sort	large language model and traditional machine learning scoring of evolutionary explanations benefits and drawbacks
topic	large language models (LLMs) machine learning (ML) automated scoring science assessment accuracy metrics ethical implications
url	https://www.mdpi.com/2227-7102/15/6/676
work_keys_str_mv	AT yunlongpan largelanguagemodelandtraditionalmachinelearningscoringofevolutionaryexplanationsbenefitsanddrawbacks AT rosshnehm largelanguagemodelandtraditionalmachinelearningscoringofevolutionaryexplanationsbenefitsanddrawbacks

Large Language Model and Traditional Machine Learning Scoring of Evolutionary Explanations: Benefits and Drawbacks

Similar Items