Evaluation of Generative AI Models in Python Code Generation: A Comparative Study

This study evaluates leading generative AI models for Python code generation. Evaluation criteria include syntax accuracy, response time, completeness, reliability, and cost. The models tested comprise OpenAI’s GPT series (GPT-4 Turbo, GPT-4o, GPT-4o Mini, GPT-3.5 Turbo), Google&#x201...

Full description

Saved in:

Bibliographic Details
Main Authors:	Dominik Palla, Antonin Slaby
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Automatization generative AI LLM python software development
Online Access:	https://ieeexplore.ieee.org/document/10963975/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850148328901181440
author	Dominik Palla Antonin Slaby
author_facet	Dominik Palla Antonin Slaby
author_sort	Dominik Palla
collection	DOAJ
description	This study evaluates leading generative AI models for Python code generation. Evaluation criteria include syntax accuracy, response time, completeness, reliability, and cost. The models tested comprise OpenAI’s GPT series (GPT-4 Turbo, GPT-4o, GPT-4o Mini, GPT-3.5 Turbo), Google’s Gemini (1.0 Pro, 1.5 Flash, 1.5 Pro), Meta’s LLaMA (3.0 8B, 3.1 8B), and Anthropic’s Claude models (3.5 Sonnet, 3 Opus, 3 Sonnet, 3 Haiku). Ten coding tasks of varying complexity were tested across three iterations per model to measure performance and consistency. Claude models, especially Claude 3.5 Sonnet, achieved the highest accuracy and reliability. They outperformed all other models in both simple and complex tasks. Gemini models showed limitations in handling complex code. Cost-effective options like Claude 3 Haiku and Gemini 1.5 Flash were budget-friendly and maintained good accuracy on simpler problems. Unlike earlier single-metric studies, this work introduces a multi-dimensional evaluation framework that considers accuracy, reliability, cost, and exception handling. Future work will explore other programming languages and include metrics such as code optimization and security robustness.
format	Article
id	doaj-art-6a0bb64651314af19b0a2e5a3cb89fe4
institution	OA Journals
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-6a0bb64651314af19b0a2e5a3cb89fe42025-08-20T02:27:16ZengIEEEIEEE Access2169-35362025-01-0113653346534710.1109/ACCESS.2025.356024410963975Evaluation of Generative AI Models in Python Code Generation: A Comparative StudyDominik Palla0https://orcid.org/0009-0002-4883-0516Antonin Slaby1https://orcid.org/0000-0002-0352-4243Faculty of Informatics and Management, University of Hradec Kralove, Hradec Kralove, Czech RepublicFaculty of Informatics and Management, University of Hradec Kralove, Hradec Kralove, Czech RepublicThis study evaluates leading generative AI models for Python code generation. Evaluation criteria include syntax accuracy, response time, completeness, reliability, and cost. The models tested comprise OpenAI’s GPT series (GPT-4 Turbo, GPT-4o, GPT-4o Mini, GPT-3.5 Turbo), Google’s Gemini (1.0 Pro, 1.5 Flash, 1.5 Pro), Meta’s LLaMA (3.0 8B, 3.1 8B), and Anthropic’s Claude models (3.5 Sonnet, 3 Opus, 3 Sonnet, 3 Haiku). Ten coding tasks of varying complexity were tested across three iterations per model to measure performance and consistency. Claude models, especially Claude 3.5 Sonnet, achieved the highest accuracy and reliability. They outperformed all other models in both simple and complex tasks. Gemini models showed limitations in handling complex code. Cost-effective options like Claude 3 Haiku and Gemini 1.5 Flash were budget-friendly and maintained good accuracy on simpler problems. Unlike earlier single-metric studies, this work introduces a multi-dimensional evaluation framework that considers accuracy, reliability, cost, and exception handling. Future work will explore other programming languages and include metrics such as code optimization and security robustness.https://ieeexplore.ieee.org/document/10963975/Automatizationgenerative AILLMpythonsoftware development
spellingShingle	Dominik Palla Antonin Slaby Evaluation of Generative AI Models in Python Code Generation: A Comparative Study IEEE Access Automatization generative AI LLM python software development
title	Evaluation of Generative AI Models in Python Code Generation: A Comparative Study
title_full	Evaluation of Generative AI Models in Python Code Generation: A Comparative Study
title_fullStr	Evaluation of Generative AI Models in Python Code Generation: A Comparative Study
title_full_unstemmed	Evaluation of Generative AI Models in Python Code Generation: A Comparative Study
title_short	Evaluation of Generative AI Models in Python Code Generation: A Comparative Study
title_sort	evaluation of generative ai models in python code generation a comparative study
topic	Automatization generative AI LLM python software development
url	https://ieeexplore.ieee.org/document/10963975/
work_keys_str_mv	AT dominikpalla evaluationofgenerativeaimodelsinpythoncodegenerationacomparativestudy AT antoninslaby evaluationofgenerativeaimodelsinpythoncodegenerationacomparativestudy

Evaluation of Generative AI Models in Python Code Generation: A Comparative Study

Similar Items