An LLM-guided platform for multi-granular collection and management of data provenance

Abstract As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase,...

Full description

Saved in:
Bibliographic Details
Main Authors: Luca Gregori, Pasquale Leonardo Lazzaro, Marialaura Lazzaro, Paolo Missier, Riccardo Torlone
Format: Article
Language:English
Published: SpringerOpen 2025-07-01
Series:Journal of Big Data
Subjects:
Online Access:https://doi.org/10.1186/s40537-025-01209-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849332963813097472
author Luca Gregori
Pasquale Leonardo Lazzaro
Marialaura Lazzaro
Paolo Missier
Riccardo Torlone
author_facet Luca Gregori
Pasquale Leonardo Lazzaro
Marialaura Lazzaro
Paolo Missier
Riccardo Torlone
author_sort Luca Gregori
collection DOAJ
description Abstract As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.
format Article
id doaj-art-2558adb29a5d470bb1a0cbfa436842fb
institution Kabale University
issn 2196-1115
language English
publishDate 2025-07-01
publisher SpringerOpen
record_format Article
series Journal of Big Data
spelling doaj-art-2558adb29a5d470bb1a0cbfa436842fb2025-08-20T03:46:03ZengSpringerOpenJournal of Big Data2196-11152025-07-0112112810.1186/s40537-025-01209-3An LLM-guided platform for multi-granular collection and management of data provenanceLuca Gregori0Pasquale Leonardo Lazzaro1Marialaura Lazzaro2Paolo Missier3Riccardo Torlone4DICITA, Università Roma TreDICITA, Università Roma TreDICITA, Università Roma TreSchool of Computer Science, University of BirminghamDICITA, Università Roma TreAbstract As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.https://doi.org/10.1186/s40537-025-01209-3Data provenanceData preparation pipelinesExplainable AI (XAI)Large language models (LLMs)
spellingShingle Luca Gregori
Pasquale Leonardo Lazzaro
Marialaura Lazzaro
Paolo Missier
Riccardo Torlone
An LLM-guided platform for multi-granular collection and management of data provenance
Journal of Big Data
Data provenance
Data preparation pipelines
Explainable AI (XAI)
Large language models (LLMs)
title An LLM-guided platform for multi-granular collection and management of data provenance
title_full An LLM-guided platform for multi-granular collection and management of data provenance
title_fullStr An LLM-guided platform for multi-granular collection and management of data provenance
title_full_unstemmed An LLM-guided platform for multi-granular collection and management of data provenance
title_short An LLM-guided platform for multi-granular collection and management of data provenance
title_sort llm guided platform for multi granular collection and management of data provenance
topic Data provenance
Data preparation pipelines
Explainable AI (XAI)
Large language models (LLMs)
url https://doi.org/10.1186/s40537-025-01209-3
work_keys_str_mv AT lucagregori anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT pasqualeleonardolazzaro anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT marialauralazzaro anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT paolomissier anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT riccardotorlone anllmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT lucagregori llmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT pasqualeleonardolazzaro llmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT marialauralazzaro llmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT paolomissier llmguidedplatformformultigranularcollectionandmanagementofdataprovenance
AT riccardotorlone llmguidedplatformformultigranularcollectionandmanagementofdataprovenance