Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem

Abstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data ma...

Full description

Saved in:
Bibliographic Details
Main Authors: Ahmed AbouZaid, Peter J. Barclay, Christos Chrysoulas, Nikolaos Pitropakis
Format: Article
Language:English
Published: Springer 2025-02-01
Series:Discover Applied Sciences
Subjects:
Online Access:https://doi.org/10.1007/s42452-025-06545-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849723022682882048
author Ahmed AbouZaid
Peter J. Barclay
Christos Chrysoulas
Nikolaos Pitropakis
author_facet Ahmed AbouZaid
Peter J. Barclay
Christos Chrysoulas
Nikolaos Pitropakis
author_sort Ahmed AbouZaid
collection DOAJ
description Abstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data management architectures has been accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. With many alternatives, organisations are unclear on how to combine these into an effective platform. Our work investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. We define the data platform specification, architecture, and core components to build a proof of concept system. Moreover, we provide a clear implementation methodology by developing the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). We then conducted performance benchmarks using an industry-standard benchmark suite to compare cold/warm start scenarios and assess Dremio’s caching capabilities, demonstrating a 12% median enhancement of query duration with caching.
format Article
id doaj-art-d61ead669be14cd2b70e9df47cd0129f
institution DOAJ
issn 3004-9261
language English
publishDate 2025-02-01
publisher Springer
record_format Article
series Discover Applied Sciences
spelling doaj-art-d61ead669be14cd2b70e9df47cd0129f2025-08-20T03:11:09ZengSpringerDiscover Applied Sciences3004-92612025-02-017312210.1007/s42452-025-06545-wBuilding a modern data platform based on the data lakehouse architecture and cloud-native ecosystemAhmed AbouZaid0Peter J. Barclay1Christos Chrysoulas2Nikolaos Pitropakis3School of Computing, Engineering & The Built Environment, Edinburgh Napier UniversitySchool of Computing, Engineering & The Built Environment, Edinburgh Napier UniversitySchool of Mathematical & Computer Sciences, Heriot-Watt University Edinburgh, ScotlandSchool of Computing, Engineering & The Built Environment, Edinburgh Napier UniversityAbstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data management architectures has been accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. With many alternatives, organisations are unclear on how to combine these into an effective platform. Our work investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. We define the data platform specification, architecture, and core components to build a proof of concept system. Moreover, we provide a clear implementation methodology by developing the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). We then conducted performance benchmarks using an industry-standard benchmark suite to compare cold/warm start scenarios and assess Dremio’s caching capabilities, demonstrating a 12% median enhancement of query duration with caching.https://doi.org/10.1007/s42452-025-06545-wData LakehouseKubernetesDataOpsCloud-NativeBig DataArtificial Intelligence
spellingShingle Ahmed AbouZaid
Peter J. Barclay
Christos Chrysoulas
Nikolaos Pitropakis
Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
Discover Applied Sciences
Data Lakehouse
Kubernetes
DataOps
Cloud-Native
Big Data
Artificial Intelligence
title Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_full Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_fullStr Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_full_unstemmed Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_short Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_sort building a modern data platform based on the data lakehouse architecture and cloud native ecosystem
topic Data Lakehouse
Kubernetes
DataOps
Cloud-Native
Big Data
Artificial Intelligence
url https://doi.org/10.1007/s42452-025-06545-w
work_keys_str_mv AT ahmedabouzaid buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem
AT peterjbarclay buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem
AT christoschrysoulas buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem
AT nikolaospitropakis buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem