Applying Machine Learning on Big Data With Apache Spark
The exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalab...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10928329/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850275827685523456 |
|---|---|
| author | Elias Dritsas Maria Trigka |
| author_facet | Elias Dritsas Maria Trigka |
| author_sort | Elias Dritsas |
| collection | DOAJ |
| description | The exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalability of these models in big data environments. Through comprehensive experiments on three diverse datasets, namely NYC Taxi Trip Duration, Netflix Prize, and Higgs Boson, we address both regression and classification tasks. For the regression tasks using the NYC Taxi Trip Duration and Netflix Prize datasets, we evaluated models including Linear Regression (LinR), Random Forest (RF), Gradient-Boosted Trees (GBT), Support Vector Regressor (SVR), and K-Nearest Neighbors (KNN). For the classification task using the Higgs Boson dataset, we assessed models such as Logistic Regression (LR), RF, GBT, Support Vector Machines (SVM), and KNN. The study employed key performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for regression and Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) for classification. Our findings indicate that Apache Spark’s in-memory processing and distributed computing capabilities provide effective scalability, allowing these models to handle large-scale data with linear increases in processing time. Finally, that study highlights the importance of model selection and resource optimization in big data contexts and provides valuable insights into the practical integration of ML models within the Spark framework. |
| format | Article |
| id | doaj-art-45c10887b3ab4b5aa4601b779bb3f74f |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-45c10887b3ab4b5aa4601b779bb3f74f2025-08-20T01:50:33ZengIEEEIEEE Access2169-35362025-01-0113533775339310.1109/ACCESS.2025.355204210928329Applying Machine Learning on Big Data With Apache SparkElias Dritsas0https://orcid.org/0000-0001-5647-2929Maria Trigka1https://orcid.org/0000-0001-7793-0407Department of Informatics and Computer Engineering, University of West Attica, Egaleo Park Campus, Athens, GreeceDepartment of Informatics and Computer Engineering, University of West Attica, Egaleo Park Campus, Athens, GreeceThe exponential growth of data in the digital age has necessitated the development of frameworks capable of efficiently handling and processing vast datasets. This paper explores the application of machine learning (ML) models within the Apache Spark ecosystem, focusing on the performance and scalability of these models in big data environments. Through comprehensive experiments on three diverse datasets, namely NYC Taxi Trip Duration, Netflix Prize, and Higgs Boson, we address both regression and classification tasks. For the regression tasks using the NYC Taxi Trip Duration and Netflix Prize datasets, we evaluated models including Linear Regression (LinR), Random Forest (RF), Gradient-Boosted Trees (GBT), Support Vector Regressor (SVR), and K-Nearest Neighbors (KNN). For the classification task using the Higgs Boson dataset, we assessed models such as Logistic Regression (LR), RF, GBT, Support Vector Machines (SVM), and KNN. The study employed key performance metrics such as Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) for regression and Accuracy, Precision, Recall, F1-Score, and Area Under the Curve (AUC) for classification. Our findings indicate that Apache Spark’s in-memory processing and distributed computing capabilities provide effective scalability, allowing these models to handle large-scale data with linear increases in processing time. Finally, that study highlights the importance of model selection and resource optimization in big data contexts and provides valuable insights into the practical integration of ML models within the Spark framework.https://ieeexplore.ieee.org/document/10928329/Big datamachine learningapache sparkdata analysis |
| spellingShingle | Elias Dritsas Maria Trigka Applying Machine Learning on Big Data With Apache Spark IEEE Access Big data machine learning apache spark data analysis |
| title | Applying Machine Learning on Big Data With Apache Spark |
| title_full | Applying Machine Learning on Big Data With Apache Spark |
| title_fullStr | Applying Machine Learning on Big Data With Apache Spark |
| title_full_unstemmed | Applying Machine Learning on Big Data With Apache Spark |
| title_short | Applying Machine Learning on Big Data With Apache Spark |
| title_sort | applying machine learning on big data with apache spark |
| topic | Big data machine learning apache spark data analysis |
| url | https://ieeexplore.ieee.org/document/10928329/ |
| work_keys_str_mv | AT eliasdritsas applyingmachinelearningonbigdatawithapachespark AT mariatrigka applyingmachinelearningonbigdatawithapachespark |