Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem
Abstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data ma...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-02-01
|
| Series: | Discover Applied Sciences |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s42452-025-06545-w |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849723022682882048 |
|---|---|
| author | Ahmed AbouZaid Peter J. Barclay Christos Chrysoulas Nikolaos Pitropakis |
| author_facet | Ahmed AbouZaid Peter J. Barclay Christos Chrysoulas Nikolaos Pitropakis |
| author_sort | Ahmed AbouZaid |
| collection | DOAJ |
| description | Abstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data management architectures has been accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. With many alternatives, organisations are unclear on how to combine these into an effective platform. Our work investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. We define the data platform specification, architecture, and core components to build a proof of concept system. Moreover, we provide a clear implementation methodology by developing the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). We then conducted performance benchmarks using an industry-standard benchmark suite to compare cold/warm start scenarios and assess Dremio’s caching capabilities, demonstrating a 12% median enhancement of query duration with caching. |
| format | Article |
| id | doaj-art-d61ead669be14cd2b70e9df47cd0129f |
| institution | DOAJ |
| issn | 3004-9261 |
| language | English |
| publishDate | 2025-02-01 |
| publisher | Springer |
| record_format | Article |
| series | Discover Applied Sciences |
| spelling | doaj-art-d61ead669be14cd2b70e9df47cd0129f2025-08-20T03:11:09ZengSpringerDiscover Applied Sciences3004-92612025-02-017312210.1007/s42452-025-06545-wBuilding a modern data platform based on the data lakehouse architecture and cloud-native ecosystemAhmed AbouZaid0Peter J. Barclay1Christos Chrysoulas2Nikolaos Pitropakis3School of Computing, Engineering & The Built Environment, Edinburgh Napier UniversitySchool of Computing, Engineering & The Built Environment, Edinburgh Napier UniversitySchool of Mathematical & Computer Sciences, Heriot-Watt University Edinburgh, ScotlandSchool of Computing, Engineering & The Built Environment, Edinburgh Napier UniversityAbstract In today’s Big Data world, organisations can gain a competitive edge by adopting data-driven decision-making. However, a modern data platform that is portable, resilient, and efficient is required to manage organisations’ data and support their growth. Furthermore, the change in the data management architectures has been accompanied by changes in storage formats, particularly open standard formats like Apache Hudi, Apache Iceberg, and Delta Lake. With many alternatives, organisations are unclear on how to combine these into an effective platform. Our work investigates capabilities provided by Kubernetes and other Cloud-Native software, using DataOps methodologies to build a generic data platform that follows the Data Lakehouse architecture. We define the data platform specification, architecture, and core components to build a proof of concept system. Moreover, we provide a clear implementation methodology by developing the core of the proposed platform, which are infrastructure (Kubernetes), ingestion and transport (Argo Workflows), storage (MinIO), and finally, query and processing (Dremio). We then conducted performance benchmarks using an industry-standard benchmark suite to compare cold/warm start scenarios and assess Dremio’s caching capabilities, demonstrating a 12% median enhancement of query duration with caching.https://doi.org/10.1007/s42452-025-06545-wData LakehouseKubernetesDataOpsCloud-NativeBig DataArtificial Intelligence |
| spellingShingle | Ahmed AbouZaid Peter J. Barclay Christos Chrysoulas Nikolaos Pitropakis Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem Discover Applied Sciences Data Lakehouse Kubernetes DataOps Cloud-Native Big Data Artificial Intelligence |
| title | Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem |
| title_full | Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem |
| title_fullStr | Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem |
| title_full_unstemmed | Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem |
| title_short | Building a modern data platform based on the data lakehouse architecture and cloud-native ecosystem |
| title_sort | building a modern data platform based on the data lakehouse architecture and cloud native ecosystem |
| topic | Data Lakehouse Kubernetes DataOps Cloud-Native Big Data Artificial Intelligence |
| url | https://doi.org/10.1007/s42452-025-06545-w |
| work_keys_str_mv | AT ahmedabouzaid buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem AT peterjbarclay buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem AT christoschrysoulas buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem AT nikolaospitropakis buildingamoderndataplatformbasedonthedatalakehousearchitectureandcloudnativeecosystem |