The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT
This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-02-01
|
| Series: | Software |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2674-113X/4/1/3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849340532051935232 |
|---|---|
| author | Nils Baumgartner Padma Iyenghar Timo Schoemaker Elke Pulvermüller |
| author_facet | Nils Baumgartner Padma Iyenghar Timo Schoemaker Elke Pulvermüller |
| author_sort | Nils Baumgartner |
| collection | DOAJ |
| description | This paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects. |
| format | Article |
| id | doaj-art-4e6177f36bc046dd9fb427fd9a952da7 |
| institution | Kabale University |
| issn | 2674-113X |
| language | English |
| publishDate | 2025-02-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Software |
| spelling | doaj-art-4e6177f36bc046dd9fb427fd9a952da72025-08-20T03:43:54ZengMDPI AGSoftware2674-113X2025-02-0141310.3390/software4010003The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPTNils Baumgartner0Padma Iyenghar1Timo Schoemaker2Elke Pulvermüller3Research Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyResearch Group Software Engineering, Institute of Computer Science, Department of Mathematics and Computer Science, University of Osnabrück, 49074 Osnabrück, GermanyThis paper explores a modular pipeline architecture that integrates ChatGPT, a Large Language Model (LLM), to automate the detection and refactoring of data clumps—a prevalent type of code smell that complicates software maintainability. Data clumps refer to clusters of code that are often repeated and should ideally be refactored to improve code quality. The pipeline leverages ChatGPT’s capabilities to understand context and generate structured outputs, making it suitable for addressing complex software refactoring tasks. Through systematic experimentation, our study not only addresses the research questions outlined but also demonstrates that the pipeline can accurately identify data clumps, particularly excelling in cases that require semantic understanding—where localized clumps are embedded within larger codebases. While the solution significantly enhances the refactoring workflow, facilitating the management of distributed clumps across multiple files, it also presents challenges such as occasional compiler errors and high computational costs. Feedback from developers underscores the usefulness of LLMs in software development but also highlights the essential role of human oversight to correct inaccuracies. These findings demonstrate the pipeline’s potential to enhance software maintainability, offering a scalable and efficient solution for addressing code smells in real-world projects, and contributing to the broader goal of enhancing software maintainability in large-scale projects.https://www.mdpi.com/2674-113X/4/1/3data clumpsmodular pipelinelarge language modelsChatGPTrefactoringscalability |
| spellingShingle | Nils Baumgartner Padma Iyenghar Timo Schoemaker Elke Pulvermüller The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT Software data clumps modular pipeline large language models ChatGPT refactoring scalability |
| title | The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT |
| title_full | The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT |
| title_fullStr | The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT |
| title_full_unstemmed | The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT |
| title_short | The Scalable Detection and Resolution of Data Clumps Using a Modular Pipeline with ChatGPT |
| title_sort | scalable detection and resolution of data clumps using a modular pipeline with chatgpt |
| topic | data clumps modular pipeline large language models ChatGPT refactoring scalability |
| url | https://www.mdpi.com/2674-113X/4/1/3 |
| work_keys_str_mv | AT nilsbaumgartner thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT padmaiyenghar thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT timoschoemaker thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT elkepulvermuller thescalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT nilsbaumgartner scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT padmaiyenghar scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT timoschoemaker scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt AT elkepulvermuller scalabledetectionandresolutionofdataclumpsusingamodularpipelinewithchatgpt |