Applying Machine Learning on Big Data With Apache Spark

The exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalab...

Full description

Saved in:
Bibliographic Details
Main Authors: Elias Dritsas, Maria Trigka
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10928329/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850275827685523456
author Elias Dritsas
Maria Trigka
author_facet Elias Dritsas
Maria Trigka
author_sort Elias Dritsas
collection DOAJ
description The exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalability of these models in big data environments. Through comprehensive experiments on three diverse datasets, namely NYC Taxi Trip Duration, Netflix Prize, and Higgs Boson, we address both regression and classification tasks. For the regression tasks using the NYC Taxi Trip Duration and Netflix Prize datasets, we evaluated models including Linear Regression (LinR), Random Forest (RF), Gradient-Boosted Trees (GBT), Support Vector Regressor (SVR), and K-Nearest Neighbors (KNN). For the classification task using the Higgs Boson dataset, we assessed models such as Logistic Regression (LR), RF, GBT, Support Vector Machines (SVM), and KNN. The study employed key performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for regression and Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) for classification. Our findings indicate that Apache Spark’s in-memory processing and distributed computing capabilities provide effective scalability, allowing these models to handle large-scale data with linear increases in processing time. Finally, that study highlights the importance of model selection and resource optimization in big data contexts and provides valuable insights into the practical integration of ML models within the Spark framework.
format Article
id doaj-art-45c10887b3ab4b5aa4601b779bb3f74f
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-45c10887b3ab4b5aa4601b779bb3f74f2025-08-20T01:50:33ZengIEEEIEEE Access2169-35362025-01-0113533775339310.1109/ACCESS.2025.355204210928329Applying Machine Learning on Big Data With Apache SparkElias Dritsas0https://orcid.org/0000-0001-5647-2929Maria Trigka1https://orcid.org/0000-0001-7793-0407Department of Informatics and Computer Engineering, University of West Attica, Egaleo Park Campus, Athens, GreeceDepartment of Informatics and Computer Engineering, University of West Attica, Egaleo Park Campus, Athens, GreeceThe exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalability of these models in big data environments. Through comprehensive experiments on three diverse datasets, namely NYC Taxi Trip Duration, Netflix Prize, and Higgs Boson, we address both regression and classification tasks. For the regression tasks using the NYC Taxi Trip Duration and Netflix Prize datasets, we evaluated models including Linear Regression (LinR), Random Forest (RF), Gradient-Boosted Trees (GBT), Support Vector Regressor (SVR), and K-Nearest Neighbors (KNN). For the classification task using the Higgs Boson dataset, we assessed models such as Logistic Regression (LR), RF, GBT, Support Vector Machines (SVM), and KNN. The study employed key performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for regression and Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) for classification. Our findings indicate that Apache Spark’s in-memory processing and distributed computing capabilities provide effective scalability, allowing these models to handle large-scale data with linear increases in processing time. Finally, that study highlights the importance of model selection and resource optimization in big data contexts and provides valuable insights into the practical integration of ML models within the Spark framework.https://ieeexplore.ieee.org/document/10928329/Big datamachine learningapache sparkdata analysis
spellingShingle Elias Dritsas
Maria Trigka
Applying Machine Learning on Big Data With Apache Spark
IEEE Access
Big data
machine learning
apache spark
data analysis
title Applying Machine Learning on Big Data With Apache Spark
title_full Applying Machine Learning on Big Data With Apache Spark
title_fullStr Applying Machine Learning on Big Data With Apache Spark
title_full_unstemmed Applying Machine Learning on Big Data With Apache Spark
title_short Applying Machine Learning on Big Data With Apache Spark
title_sort applying machine learning on big data with apache spark
topic Big data
machine learning
apache spark
data analysis
url https://ieeexplore.ieee.org/document/10928329/
work_keys_str_mv AT eliasdritsas applyingmachinelearningonbigdatawithapachespark
AT mariatrigka applyingmachinelearningonbigdatawithapachespark