The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT

This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nils Baumgartner, Padma Iyenghar, Timo Schoemaker, Elke Pulvermüller
Format:	Article
Language:	English
Published:	MDPI AG 2025-02-01
Series:	Software
Subjects:	data clumps modular pipeline large language models ChatGPT refactoring scalability
Online Access:	https://www.mdpi.com/2674-113X/4/1/3
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849340532051935232
author	Nils Baumgartner Padma Iyenghar Timo Schoemaker Elke Pulvermüller
author_facet	Nils Baumgartner Padma Iyenghar Timo Schoemaker Elke Pulvermüller
author_sort	Nils Baumgartner
collection	DOAJ
description	This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects.
format	Article
id	doaj-art-4e6177f36bc046dd9fb427fd9a952da7
institution	Kabale University
issn	2674-113X
language	English
publishDate	2025-02-01
publisher	MDPI AG
record_format	Article
series	Software
spelling	doaj-art-4e6177f36bc046dd9fb427fd9a952da72025-08-20T03:43:54ZengMDPI AGSoftware2674-113X2025-02-0141310.3390/software4010003The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPTNils Baumgartner0Padma Iyenghar1Timo Schoemaker2Elke Pulvermüller3Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyThis paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects.https://www.mdpi.com/2674-113X/4/1/3data clumpsmodular pipelinelarge language modelsChatGPTrefactoringscalability
spellingShingle	Nils Baumgartner Padma Iyenghar Timo Schoemaker Elke Pulvermüller The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT Software data clumps modular pipeline large language models ChatGPT refactoring scalability
title	The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_full	The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_fullStr	The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_full_unstemmed	The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_short	The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
title_sort	scalable detection and resolution of data clumps using a modular pipeline with chatgpt
topic	data clumps modular pipeline large language models ChatGPT refactoring scalability
url	https://www.mdpi.com/2674-113X/4/1/3
work_keys_str_mv	AT nilsbaumgartner thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT padmaiyenghar thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT timoschoemaker thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT elkepulvermuller thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT nilsbaumgartner scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT padmaiyenghar scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT timoschoemaker scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT elkepulvermuller scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt

The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT

Similar Items