An LLM-guided platform for multi-granular collection and management of data provenance
Abstract As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase,...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
SpringerOpen
2025-07-01
|
| Series: | Journal of Big Data |
| Subjects: | |
| Online Access: | https://doi.org/10.1186/s40537-025-01209-3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849332963813097472 |
|---|---|
| author | Luca Gregori Pasquale Leonardo Lazzaro Marialaura Lazzaro Paolo Missier Riccardo Torlone |
| author_facet | Luca Gregori Pasquale Leonardo Lazzaro Marialaura Lazzaro Paolo Missier Riccardo Torlone |
| author_sort | Luca Gregori |
| collection | DOAJ |
| description | Abstract As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists. |
| format | Article |
| id | doaj-art-2558adb29a5d470bb1a0cbfa436842fb |
| institution | Kabale University |
| issn | 2196-1115 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | SpringerOpen |
| record_format | Article |
| series | Journal of Big Data |
| spelling | doaj-art-2558adb29a5d470bb1a0cbfa436842fb2025-08-20T03:46:03ZengSpringerOpenJournal of Big Data2196-11152025-07-0112112810.1186/s40537-025-01209-3An LLM-guided platform for multi-granular collection and management of data provenanceLuca Gregori0Pasquale Leonardo Lazzaro1Marialaura Lazzaro2Paolo Missier3Riccardo Torlone4DICITA, Università Roma TreDICITA, Università Roma TreDICITA, Università Roma TreSchool of Computer Science, University of BirminghamDICITA, Università Roma TreAbstract As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.https://doi.org/10.1186/s40537-025-01209-3Data provenanceData preparation pipelinesExplainable AI (XAI)Large language models (LLMs) |
| spellingShingle | Luca Gregori Pasquale Leonardo Lazzaro Marialaura Lazzaro Paolo Missier Riccardo Torlone An LLM-guided platform for multi-granular collection and management of data provenance Journal of Big Data Data provenance Data preparation pipelines Explainable AI (XAI) Large language models (LLMs) |
| title | An LLM-guided platform for multi-granular collection and management of data provenance |
| title_full | An LLM-guided platform for multi-granular collection and management of data provenance |
| title_fullStr | An LLM-guided platform for multi-granular collection and management of data provenance |
| title_full_unstemmed | An LLM-guided platform for multi-granular collection and management of data provenance |
| title_short | An LLM-guided platform for multi-granular collection and management of data provenance |
| title_sort | llm guided platform for multi granular collection and management of data provenance |
| topic | Data provenance Data preparation pipelines Explainable AI (XAI) Large language models (LLMs) |
| url | https://doi.org/10.1186/s40537-025-01209-3 |
| work_keys_str_mv | AT lucagregori anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT pasqualeleonardolazzaro anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT marialauralazzaro anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT paolomissier anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT riccardotorlone anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT lucagregori llmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT pasqualeleonardolazzaro llmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT marialauralazzaro llmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT paolomissier llmguidedplatformformultigranularcollectionandmanagementofdataprovenance AT riccardotorlone llmguidedplatformformultigranularcollectionandmanagementofdataprovenance |