Applying Machine Learning on Big Data With Apache Spark

The exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalab...

Full description

Saved in:

Bibliographic Details
Main Authors:	Elias Dritsas, Maria Trigka
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Big data machine learning apache spark data analysis
Online Access:	https://ieeexplore.ieee.org/document/10928329/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850275827685523456
author	Elias Dritsas Maria Trigka
author_facet	Elias Dritsas Maria Trigka
author_sort	Elias Dritsas
collection	DOAJ
description	The exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalability of these models in big data environments. Through comprehensive experiments on three diverse datasets, namely NYC Taxi Trip Duration, Netflix Prize, and Higgs Boson, we address both regression and classification tasks. For the regression tasks using the NYC Taxi Trip Duration and Netflix Prize datasets, we evaluated models including Linear Regression (LinR), Random Forest (RF), Gradient-Boosted Trees (GBT), Support Vector Regressor (SVR), and K-Nearest Neighbors (KNN). For the classification task using the Higgs Boson dataset, we assessed models such as Logistic Regression (LR), RF, GBT, Support Vector Machines (SVM), and KNN. The study employed key performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for regression and Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) for classification. Our findings indicate that Apache Spark’s in-memory processing and distributed computing capabilities provide effective scalability, allowing these models to handle large-scale data with linear increases in processing time. Finally, that study highlights the importance of model selection and resource optimization in big data contexts and provides valuable insights into the practical integration of ML models within the Spark framework.
format	Article
id	doaj-art-45c10887b3ab4b5aa4601b779bb3f74f
institution	OA Journals
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-45c10887b3ab4b5aa4601b779bb3f74f2025-08-20T01:50:33ZengIEEEIEEE Access2169-35362025-01-0113533775339310.1109/ACCESS.2025.355204210928329Applying Machine Learning on Big Data With Apache SparkElias Dritsas0https://orcid.org/0000-0001-5647-2929Maria Trigka1https://orcid.org/0000-0001-7793-0407Department of Informatics and Computer Engineering, University of West Attica, Egaleo Park Campus, Athens, GreeceDepartment of Informatics and Computer Engineering, University of West Attica, Egaleo Park Campus, Athens, GreeceThe exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalability of these models in big data environments. Through comprehensive experiments on three diverse datasets, namely NYC Taxi Trip Duration, Netflix Prize, and Higgs Boson, we address both regression and classification tasks. For the regression tasks using the NYC Taxi Trip Duration and Netflix Prize datasets, we evaluated models including Linear Regression (LinR), Random Forest (RF), Gradient-Boosted Trees (GBT), Support Vector Regressor (SVR), and K-Nearest Neighbors (KNN). For the classification task using the Higgs Boson dataset, we assessed models such as Logistic Regression (LR), RF, GBT, Support Vector Machines (SVM), and KNN. The study employed key performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for regression and Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) for classification. Our findings indicate that Apache Spark’s in-memory processing and distributed computing capabilities provide effective scalability, allowing these models to handle large-scale data with linear increases in processing time. Finally, that study highlights the importance of model selection and resource optimization in big data contexts and provides valuable insights into the practical integration of ML models within the Spark framework.https://ieeexplore.ieee.org/document/10928329/Big datamachine learningapache sparkdata analysis
spellingShingle	Elias Dritsas Maria Trigka Applying Machine Learning on Big Data With Apache Spark IEEE Access Big data machine learning apache spark data analysis
title	Applying Machine Learning on Big Data With Apache Spark
title_full	Applying Machine Learning on Big Data With Apache Spark
title_fullStr	Applying Machine Learning on Big Data With Apache Spark
title_full_unstemmed	Applying Machine Learning on Big Data With Apache Spark
title_short	Applying Machine Learning on Big Data With Apache Spark
title_sort	applying machine learning on big data with apache spark
topic	Big data machine learning apache spark data analysis
url	https://ieeexplore.ieee.org/document/10928329/
work_keys_str_mv	AT eliasdritsas applyingmachinelearningonbigdatawithapachespark AT mariatrigka applyingmachinelearningonbigdatawithapachespark

Applying Machine Learning on Big Data With Apache Spark

Similar Items