Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem

Abstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data ma...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ahmed AbouZaid, Peter J. Barclay, Christos Chrysoulas, Nikolaos Pitropakis
Format:	Article
Language:	English
Published:	Springer 2025-02-01
Series:	Discover Applied Sciences
Subjects:	Data Lakehouse Kubernetes DataOps Cloud-Native Big Data Artificial Intelligence
Online Access:	https://doi.org/10.1007/s42452-025-06545-w
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849723022682882048
author	Ahmed AbouZaid Peter J. Barclay Christos Chrysoulas Nikolaos Pitropakis
author_facet	Ahmed AbouZaid Peter J. Barclay Christos Chrysoulas Nikolaos Pitropakis
author_sort	Ahmed AbouZaid
collection	DOAJ
description	Abstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data management architectures has been accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. With many alternatives, organisations are unclear on how to combine these into an effective platform. Our work investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. We define the data platform specification, architecture, and core components to build a proof of concept system. Moreover, we provide a clear implementation methodology by developing the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). We then conducted performance benchmarks using an industry-standard benchmark suite to compare cold/warm start scenarios and assess Dremio’s caching capabilities, demonstrating a 12% median enhancement of query duration with caching.
format	Article
id	doaj-art-d61ead669be14cd2b70e9df47cd0129f
institution	DOAJ
issn	3004-9261
language	English
publishDate	2025-02-01
publisher	Springer
record_format	Article
series	Discover Applied Sciences
spelling	doaj-art-d61ead669be14cd2b70e9df47cd0129f2025-08-20T03:11:09ZengSpringerDiscover Applied Sciences3004-92612025-02-017312210.1007/s42452-025-06545-wBuilding a modern data platform based on the data lakehouse architecture and cloud-native ecosystemAhmed AbouZaid0Peter J. Barclay1Christos Chrysoulas2Nikolaos Pitropakis3School of Computing, Engineering & The Built Environment, Edinburgh Napier UniversitySchool of Computing, Engineering & The Built Environment, Edinburgh Napier UniversitySchool of Mathematical & Computer Sciences, Heriot-Watt University Edinburgh, ScotlandSchool of Computing, Engineering & The Built Environment, Edinburgh Napier UniversityAbstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data management architectures has been accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. With many alternatives, organisations are unclear on how to combine these into an effective platform. Our work investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. We define the data platform specification, architecture, and core components to build a proof of concept system. Moreover, we provide a clear implementation methodology by developing the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). We then conducted performance benchmarks using an industry-standard benchmark suite to compare cold/warm start scenarios and assess Dremio’s caching capabilities, demonstrating a 12% median enhancement of query duration with caching.https://doi.org/10.1007/s42452-025-06545-wData LakehouseKubernetesDataOpsCloud-NativeBig DataArtificial Intelligence
spellingShingle	Ahmed AbouZaid Peter J. Barclay Christos Chrysoulas Nikolaos Pitropakis Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem Discover Applied Sciences Data Lakehouse Kubernetes DataOps Cloud-Native Big Data Artificial Intelligence
title	Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_full	Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_fullStr	Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_full_unstemmed	Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_short	Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
title_sort	building a modern data platform based on the data lakehouse architecture and cloud native ecosystem
topic	Data Lakehouse Kubernetes DataOps Cloud-Native Big Data Artificial Intelligence
url	https://doi.org/10.1007/s42452-025-06545-w
work_keys_str_mv	AT ahmedabouzaid buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem AT peterjbarclay buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem AT christoschrysoulas buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem AT nikolaospitropakis buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem

Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem

Similar Items