MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments

The explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning worklo...

Full description

Saved in:
Bibliographic Details
Main Authors: Sakhr A. Saleh, Maher A. Khemakhem, Fathy E. Eassa
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11062570/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849320302539964416
author Sakhr A. Saleh
Maher A. Khemakhem
Fathy E. Eassa
author_facet Sakhr A. Saleh
Maher A. Khemakhem
Fathy E. Eassa
author_sort Sakhr A. Saleh
collection DOAJ
description The explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning workloads require substantial resources. However, integrating HPC and big data presents challenges due to architectural differences. This study introduces an MPJ-Spark integration-based technique that includes a novel multi-Spark-driver architecture to bridge this gap. MPJ-Spark enables a single application to execute concurrently across multiple Spark drivers, thereby improving parallelization and resource management. The methodology involves designing an MPJ-Spark cluster that integrates Message Passing in Java (MPJ) with Spark for efficient communication across HPC nodes. A single MPJ root process manages cluster communications, input file partitioning, and distributes partition metadata to MPJ workers. Each worker operates with an isolated Spark driver, processes tasks independently, and returns results to the root process for aggregation. This eliminates remote shuffling and improves network efficiency. A key-value data structure was developed to facilitate data exchange and to convert Resilient Distributed Dataset (RDD) into contiguous arrays for MPJ. A shared-storage-aware file manager was designed to improve the reading, writing, and partitioning of the datasets. MPJ-Spark was evaluated on the Aziz Supercomputer utilizing the WordCount workload across datasets ranging from 32 GB to 4.3 TB. The results demonstrated a significant improvement in execution time, ranging from 4x to 6x faster than Spark. This technique enables big data applications to leverage HPC’s computational power and effectively address the gap between these platforms.
format Article
id doaj-art-fe1c836f95464855b8d8e6ca9aa4b85d
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-fe1c836f95464855b8d8e6ca9aa4b85d2025-08-20T03:50:07ZengIEEEIEEE Access2169-35362025-01-011311424111425510.1109/ACCESS.2025.358474411062570MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing EnvironmentsSakhr A. Saleh0https://orcid.org/0009-0007-7616-5441Maher A. Khemakhem1https://orcid.org/0000-0002-1287-1634Fathy E. Eassa2https://orcid.org/0000-0003-3987-9051Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaDepartment of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaDepartment of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi ArabiaThe explosion of data from various sources such as smartphone applications, sensors, social media, and High-Performance Computing (HPC) simulations, has driven demand for high-performance data analytics. Traditional analytics tools lag HPC in computational efficiency, whereas machine learning workloads require substantial resources. However, integrating HPC and big data presents challenges due to architectural differences. This study introduces an MPJ-Spark integration-based technique that includes a novel multi-Spark-driver architecture to bridge this gap. MPJ-Spark enables a single application to execute concurrently across multiple Spark drivers, thereby improving parallelization and resource management. The methodology involves designing an MPJ-Spark cluster that integrates Message Passing in Java (MPJ) with Spark for efficient communication across HPC nodes. A single MPJ root process manages cluster communications, input file partitioning, and distributes partition metadata to MPJ workers. Each worker operates with an isolated Spark driver, processes tasks independently, and returns results to the root process for aggregation. This eliminates remote shuffling and improves network efficiency. A key-value data structure was developed to facilitate data exchange and to convert Resilient Distributed Dataset (RDD) into contiguous arrays for MPJ. A shared-storage-aware file manager was designed to improve the reading, writing, and partitioning of the datasets. MPJ-Spark was evaluated on the Aziz Supercomputer utilizing the WordCount workload across datasets ranging from 32 GB to 4.3 TB. The results demonstrated a significant improvement in execution time, ranging from 4x to 6x faster than Spark. This technique enables big data applications to leverage HPC’s computational power and effectively address the gap between these platforms.https://ieeexplore.ieee.org/document/11062570/Big dataHPCSPARKMPJHPC and big data integrationmulti-spark driver architecture
spellingShingle Sakhr A. Saleh
Maher A. Khemakhem
Fathy E. Eassa
MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
IEEE Access
Big data
HPC
SPARK
MPJ
HPC and big data integration
multi-spark driver architecture
title MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_full MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_fullStr MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_full_unstemmed MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_short MPJ-SPARK Integration-Based Technique to Enhance Big Data Analytics in High Performance Computing Environments
title_sort mpj spark integration based technique to enhance big data analytics in high performance computing environments
topic Big data
HPC
SPARK
MPJ
HPC and big data integration
multi-spark driver architecture
url https://ieeexplore.ieee.org/document/11062570/
work_keys_str_mv AT sakhrasaleh mpjsparkintegrationbasedtechniquetoenhancebigdataanalyticsinhighperformancecomputingenvironments
AT maherakhemakhem mpjsparkintegrationbasedtechniquetoenhancebigdataanalyticsinhighperformancecomputingenvironments
AT fathyeeassa mpjsparkintegrationbasedtechniquetoenhancebigdataanalyticsinhighperformancecomputingenvironments