Machine Learning in Baseball Analytics: Sabermetrics and Beyond
In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been us...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Information |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2078-2489/16/5/361 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849711857566220288 |
|---|---|
| author | Wenbing Zhao Vyaghri Seetharamayya Akella Shunkun Yang Xiong Luo |
| author_facet | Wenbing Zhao Vyaghri Seetharamayya Akella Shunkun Yang Xiong Luo |
| author_sort | Wenbing Zhao |
| collection | DOAJ |
| description | In this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation. |
| format | Article |
| id | doaj-art-04032200dd0f4d2a91b92001dfba269b |
| institution | DOAJ |
| issn | 2078-2489 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Information |
| spelling | doaj-art-04032200dd0f4d2a91b92001dfba269b2025-08-20T03:14:31ZengMDPI AGInformation2078-24892025-04-0116536110.3390/info16050361Machine Learning in Baseball Analytics: Sabermetrics and BeyondWenbing Zhao0Vyaghri Seetharamayya Akella1Shunkun Yang2Xiong Luo3Department of Electrical and Computer Engineering, Cleveland State University, Cleveland, OH 44115, USADepartment of Electrical and Computer Engineering, Cleveland State University, Cleveland, OH 44115, USASchool of Reliability and Systems Engineering, Beihang University, 37 Xueyuan Road, Beijing 100191, ChinaSchool of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, ChinaIn this article, we provide a comprehensive review of machine learning-based sports analytics in baseball. This review is primarily guided by the following three research questions: (1) What baseball analytics problems have been studied using machine learning? (2) What data repositories have been used? (3) What and how machine learning techniques have been employed for these studies? The findings of these research questions lead to several research contributions. First, we provide a taxonomy for baseball analytics problems. According to the proposed taxonomy, machine learning has been employed to (1) predict individual game plays; (2) determine player performance; (3) estimate player valuation; (4) predict future player injuries; and (5) project future game outcomes. Second, we identify a set of data repositories for baseball analytics studies. The most popular data repositories are Baseball Savant and Baseball Reference. Third, we conduct an in-depth analysis of the machine learning models applied in baseball analytics. The most popular machine learning models are random forest and support vector machine. Furthermore, only a small fraction of studies have rigorously followed the best practices in data preprocessing, machine learning model training, testing, and prediction outcome interpretation.https://www.mdpi.com/2078-2489/16/5/361sports analyticssabermetricsmajor league baseballmachine learningfeature importancecross-validation |
| spellingShingle | Wenbing Zhao Vyaghri Seetharamayya Akella Shunkun Yang Xiong Luo Machine Learning in Baseball Analytics: Sabermetrics and Beyond Information sports analytics sabermetrics major league baseball machine learning feature importance cross-validation |
| title | Machine Learning in Baseball Analytics: Sabermetrics and Beyond |
| title_full | Machine Learning in Baseball Analytics: Sabermetrics and Beyond |
| title_fullStr | Machine Learning in Baseball Analytics: Sabermetrics and Beyond |
| title_full_unstemmed | Machine Learning in Baseball Analytics: Sabermetrics and Beyond |
| title_short | Machine Learning in Baseball Analytics: Sabermetrics and Beyond |
| title_sort | machine learning in baseball analytics sabermetrics and beyond |
| topic | sports analytics sabermetrics major league baseball machine learning feature importance cross-validation |
| url | https://www.mdpi.com/2078-2489/16/5/361 |
| work_keys_str_mv | AT wenbingzhao machinelearninginbaseballanalyticssabermetricsandbeyond AT vyaghriseetharamayyaakella machinelearninginbaseballanalyticssabermetricsandbeyond AT shunkunyang machinelearninginbaseballanalyticssabermetricsandbeyond AT xiongluo machinelearninginbaseballanalyticssabermetricsandbeyond |