The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT

This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated...

Full description

Saved in:
Bibliographic Details
Main Authors: Nils Baumgartner, Padma Iyenghar, Timo Schoemaker, Elke Pulvermüller
Format: Article
Language:English
Published: MDPI AG 2025-02-01
Series:Software
Subjects:
Online Access:https://www.mdpi.com/2674-113X/4/1/3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849340532051935232
author Nils Baumgartner
Padma Iyenghar
Timo Schoemaker
Elke Pulvermüller
author_facet Nils Baumgartner
Padma Iyenghar
Timo Schoemaker
Elke Pulvermüller
author_sort Nils Baumgartner
collection DOAJ
description This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects.
format Article
id doaj-art-4e6177f36bc046dd9fb427fd9a952da7
institution Kabale University
issn 2674-113X
language English
publishDate 2025-02-01
publisher MDPI AG
record_format Article
series Software
spelling doaj-art-4e6177f36bc046dd9fb427fd9a952da72025-08-20T03:43:54ZengMDPI AGSoftware2674-113X2025-02-0141310.3390/software4010003The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPTNils Baumgartner0Padma Iyenghar1Timo Schoemaker2Elke Pulvermüller3Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyThis paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects.https://www.mdpi.com/2674-113X/4/1/3data clumpsmodular pipelinelarge language modelsChatGPTrefactoringscalability
spellingShingle Nils Baumgartner
Padma Iyenghar
Timo Schoemaker
Elke Pulvermüller
The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
Software
data clumps
modular pipeline
large language models
ChatGPT
refactoring
scalability
title The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_full The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_fullStr The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_full_unstemmed The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_short The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_sort scalable detection and resolution of data clumps using a modular pipeline with chatgpt
topic data clumps
modular pipeline
large language models
ChatGPT
refactoring
scalability
url https://www.mdpi.com/2674-113X/4/1/3
work_keys_str_mv AT nilsbaumgartner thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt
AT padmaiyenghar thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt
AT timoschoemaker thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt
AT elkepulvermuller thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt
AT nilsbaumgartner scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt
AT padmaiyenghar scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt
AT timoschoemaker scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt
AT elkepulvermuller scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt