Generative language models potential for requirement engineering applications: insights into current strengths and limitations

Abstract Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language mod...

Full description

Saved in:

Bibliographic Details
Main Authors:	Summra Saleem, Muhammad Nabeel Asim, Ludger Van Elst, Andreas Dengel
Format:	Article
Language:	English
Published:	Springer 2025-05-01
Series:	Complex & Intelligent Systems
Subjects:	Requirement engineering Requirements extraction Requirements classification Named entity recognition Question answering system Generative language models
Online Access:	https://doi.org/10.1007/s40747-024-01707-6
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850273200855842816
author	Summra Saleem Muhammad Nabeel Asim Ludger Van Elst Andreas Dengel
author_facet	Summra Saleem Muhammad Nabeel Asim Ludger Van Elst Andreas Dengel
author_sort	Summra Saleem
collection	DOAJ
description	Abstract Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7445 samples, requirements extraction), PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77, respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.
format	Article
id	doaj-art-7bc5dc50f7e842988ce217f9d9bfa24d
institution	OA Journals
issn	2199-4536 2198-6053
language	English
publishDate	2025-05-01
publisher	Springer
record_format	Article
series	Complex & Intelligent Systems
spelling	doaj-art-7bc5dc50f7e842988ce217f9d9bfa24d2025-08-20T01:51:35ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-05-0111612210.1007/s40747-024-01707-6Generative language models potential for requirement engineering applications: insights into current strengths and limitationsSummra Saleem0Muhammad Nabeel Asim1Ludger Van Elst2Andreas Dengel3Department of Computer Science, Rhineland-Palatinte Technical University of Kaiserslautern-LandauGerman Research Center for Artificial Intelligence GmbHGerman Research Center for Artificial Intelligence GmbHDepartment of Computer Science, Rhineland-Palatinte Technical University of Kaiserslautern-LandauAbstract Traditional language models have been extensively evaluated for software engineering domain, however the potential of ChatGPT and Gemini have not been fully explored. To fulfill this gap, the paper in hand presents a comprehensive case study to investigate the potential of both language models for development of diverse types of requirement engineering applications. It deeply explores impact of varying levels of expert knowledge prompts on the prediction accuracies of both language models. Across 4 different public benchmark datasets of requirement engineering tasks, it compares performance of both language models with existing task specific machine/deep learning predictors and traditional language models. Specifically, the paper utilizes 4 benchmark datasets; Pure (7445 samples, requirements extraction), PROMISE (622 samples, requirements classification), REQuestA (300 question answer (QA) pairs) and Aerospace datasets (6347 words, requirements NER tagging). Our experiments reveal that, in comparison to ChatGPT, Gemini requires more careful prompt engineering to provide accurate predictions. Moreover, across requirement extraction benchmark dataset the state-of-the-art F1-score is 0.86 while ChatGPT and Gemini achieved 0.76 and 0.77, respectively. The State-of-the-art F1-score on requirements classification dataset is 0.96 and both language models 0.78. In name entity recognition (NER) task the state-of-the-art F1-score is 0.92 and ChatGPT managed to produce 0.36, and Gemini 0.25. Similarly, across question answering dataset the state-of-the-art F1-score is 0.90 and ChatGPT and Gemini managed to produce 0.91 and 0.88 respectively. Our experiments show that Gemini requires more precise prompt engineering than ChatGPT. Except for question-answering, both models under-perform compared to current state-of-the-art predictors across other tasks.https://doi.org/10.1007/s40747-024-01707-6Requirement engineeringRequirements extractionRequirements classificationNamed entity recognitionQuestion answering systemGenerative language models
spellingShingle	Summra Saleem Muhammad Nabeel Asim Ludger Van Elst Andreas Dengel Generative language models potential for requirement engineering applications: insights into current strengths and limitations Complex & Intelligent Systems Requirement engineering Requirements extraction Requirements classification Named entity recognition Question answering system Generative language models
title	Generative language models potential for requirement engineering applications: insights into current strengths and limitations
title_full	Generative language models potential for requirement engineering applications: insights into current strengths and limitations
title_fullStr	Generative language models potential for requirement engineering applications: insights into current strengths and limitations
title_full_unstemmed	Generative language models potential for requirement engineering applications: insights into current strengths and limitations
title_short	Generative language models potential for requirement engineering applications: insights into current strengths and limitations
title_sort	generative language models potential for requirement engineering applications insights into current strengths and limitations
topic	Requirement engineering Requirements extraction Requirements classification Named entity recognition Question answering system Generative language models
url	https://doi.org/10.1007/s40747-024-01707-6
work_keys_str_mv	AT summrasaleem generativelanguagemodelspotentialforrequirementengineeringapplicationsinsightsintocurrentstrengthsandlimitations AT muhammadnabeelasim generativelanguagemodelspotentialforrequirementengineeringapplicationsinsightsintocurrentstrengthsandlimitations AT ludgervanelst generativelanguagemodelspotentialforrequirementengineeringapplicationsinsightsintocurrentstrengthsandlimitations AT andreasdengel generativelanguagemodelspotentialforrequirementengineeringapplicationsinsightsintocurrentstrengthsandlimitations

Generative language models potential for requirement engineering applications: insights into current strengths and limitations

Similar Items