Machine Learning in Baseball Analytics: Sabermetrics and Beyond

In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been us...

Full description

Saved in:
Bibliographic Details
Main Authors: Wenbing Zhao, Vyaghri Seetharamayya Akella, Shunkun Yang, Xiong Luo
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/5/361
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849711857566220288
author Wenbing Zhao
Vyaghri Seetharamayya Akella
Shunkun Yang
Xiong Luo
author_facet Wenbing Zhao
Vyaghri Seetharamayya Akella
Shunkun Yang
Xiong Luo
author_sort Wenbing Zhao
collection DOAJ
description In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation.
format Article
id doaj-art-04032200dd0f4d2a91b92001dfba269b
institution DOAJ
issn 2078-2489
language English
publishDate 2025-04-01
publisher MDPI AG
record_format Article
series Information
spelling doaj-art-04032200dd0f4d2a91b92001dfba269b2025-08-20T03:14:31ZengMDPI AGInformation2078-24892025-04-0116536110.3390/info16050361Machine Learning in Baseball Analytics: Sabermetrics and BeyondWenbing Zhao0Vyaghri Seetharamayya Akella1Shunkun Yang2Xiong Luo3Department of Electrical and Computer Engineering, Cleveland State University, Cleveland, OH 44115, USADepartment of Electrical and Computer Engineering, Cleveland State University, Cleveland, OH 44115, USASchool of Reliability and Systems Engineering, Beihang University, 37 Xueyuan Road, Beijing 100191, ChinaSchool of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, ChinaIn this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation.https://www.mdpi.com/2078-2489/16/5/361sports analyticssabermetricsmajor league baseballmachine learningfeature importancecross-validation
spellingShingle Wenbing Zhao
Vyaghri Seetharamayya Akella
Shunkun Yang
Xiong Luo
Machine Learning in Baseball Analytics: Sabermetrics and Beyond
Information
sports analytics
sabermetrics
major league baseball
machine learning
feature importance
cross-validation
title Machine Learning in Baseball Analytics: Sabermetrics and Beyond
title_full Machine Learning in Baseball Analytics: Sabermetrics and Beyond
title_fullStr Machine Learning in Baseball Analytics: Sabermetrics and Beyond
title_full_unstemmed Machine Learning in Baseball Analytics: Sabermetrics and Beyond
title_short Machine Learning in Baseball Analytics: Sabermetrics and Beyond
title_sort machine learning in baseball analytics sabermetrics and beyond
topic sports analytics
sabermetrics
major league baseball
machine learning
feature importance
cross-validation
url https://www.mdpi.com/2078-2489/16/5/361
work_keys_str_mv AT wenbingzhao machinelearninginbaseballanalyticssabermetricsandbeyond
AT vyaghriseetharamayyaakella machinelearninginbaseballanalyticssabermetricsandbeyond
AT shunkunyang machinelearninginbaseballanalyticssabermetricsandbeyond
AT xiongluo machinelearninginbaseballanalyticssabermetricsandbeyond