MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments

The explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning worklo...

Full description

Saved in:

Bibliographic Details
Main Authors:	Sakhr A. Saleh, Maher A. Khemakhem, Fathy E. Eassa
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Big data HPC SPARK MPJ HPC and big data integration multi-spark driver architecture
Online Access:	https://ieeexplore.ieee.org/document/11062570/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849320302539964416
author	Sakhr A. Saleh Maher A. Khemakhem Fathy E. Eassa
author_facet	Sakhr A. Saleh Maher A. Khemakhem Fathy E. Eassa
author_sort	Sakhr A. Saleh
collection	DOAJ
description	The explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning workloads require substantial resources. However, integrating HPC and big data presents challenges due to architectural differences. This study introduces an MPJ-Spark integration-based technique that includes a novel multi-Spark-driver architecture to bridge this gap. MPJ-Spark enables a single application to execute concurrently across multiple Spark drivers, thereby improving parallelization and resource management. The methodology involves designing an MPJ-Spark cluster that integrates Message Passing in Java (MPJ) with Spark for efficient communication across HPC nodes. A single MPJ root process manages cluster communications, input file partitioning, and distributes partition metadata to MPJ workers. Each worker operates with an isolated Spark driver, processes tasks independently, and returns results to the root process for aggregation. This eliminates remote shuffling and improves network efficiency. A key-value data structure was developed to facilitate data exchange and to convert Resilient Distributed Dataset (RDD) into contiguous arrays for MPJ. A shared-storage-aware file manager was designed to improve the reading, writing, and partitioning of the datasets. MPJ-Spark was evaluated on the Aziz Supercomputer utilizing the WordCount workload across datasets ranging from 32 GB to 4.3 TB. The results demonstrated a significant improvement in execution time, ranging from 4x to 6x faster than Spark. This technique enables big data applications to leverage HPC’s computational power and effectively address the gap between these platforms.
format	Article
id	doaj-art-fe1c836f95464855b8d8e6ca9aa4b85d
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-fe1c836f95464855b8d8e6ca9aa4b85d2025-08-20T03:50:07ZengIEEEIEEE Access2169-35362025-01-011311424111425510.1109/ACCESS.2025.358474411062570MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing EnvironmentsSakhr A. Saleh0https://orcid.org/0009-0007-7616-5441Maher A. Khemakhem1https://orcid.org/0000-0002-1287-1634Fathy E. Eassa2https://orcid.org/0000-0003-3987-9051Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaDepartment of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaDepartment of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaThe explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning workloads require substantial resources. However, integrating HPC and big data presents challenges due to architectural differences. This study introduces an MPJ-Spark integration-based technique that includes a novel multi-Spark-driver architecture to bridge this gap. MPJ-Spark enables a single application to execute concurrently across multiple Spark drivers, thereby improving parallelization and resource management. The methodology involves designing an MPJ-Spark cluster that integrates Message Passing in Java (MPJ) with Spark for efficient communication across HPC nodes. A single MPJ root process manages cluster communications, input file partitioning, and distributes partition metadata to MPJ workers. Each worker operates with an isolated Spark driver, processes tasks independently, and returns results to the root process for aggregation. This eliminates remote shuffling and improves network efficiency. A key-value data structure was developed to facilitate data exchange and to convert Resilient Distributed Dataset (RDD) into contiguous arrays for MPJ. A shared-storage-aware file manager was designed to improve the reading, writing, and partitioning of the datasets. MPJ-Spark was evaluated on the Aziz Supercomputer utilizing the WordCount workload across datasets ranging from 32 GB to 4.3 TB. The results demonstrated a significant improvement in execution time, ranging from 4x to 6x faster than Spark. This technique enables big data applications to leverage HPC’s computational power and effectively address the gap between these platforms.https://ieeexplore.ieee.org/document/11062570/Big dataHPCSPARKMPJHPC and big data integrationmulti-spark driver architecture
spellingShingle	Sakhr A. Saleh Maher A. Khemakhem Fathy E. Eassa MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments IEEE Access Big data HPC SPARK MPJ HPC and big data integration multi-spark driver architecture
title	MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_full	MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_fullStr	MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_full_unstemmed	MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_short	MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_sort	mpj spark integration based technique to enhance big data analytics in high performance computing environments
topic	Big data HPC SPARK MPJ HPC and big data integration multi-spark driver architecture
url	https://ieeexplore.ieee.org/document/11062570/
work_keys_str_mv	AT sakhrasaleh mpjsparkintegrationbasedtechniquetoenhancebigdataanalyticsinhighperformancecomputingenvironments AT maherakhemakhem mpjsparkintegrationbasedtechniquetoenhancebigdataanalyticsinhighperformancecomputingenvironments AT fathyeeassa mpjsparkintegrationbasedtechniquetoenhancebigdataanalyticsinhighperformancecomputingenvironments

MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments

Similar Items