Comparative Analysis of Generative Artificial Intelligence Systems in Solving Clinical Pharmacy Problems: Mixed Methods Study

Abstract BackgroundGenerative artificial intelligence (AI) systems are increasingly deployed in clinical pharmacy; yet, systematic evaluation of their efficacy, limitations, and risks across diverse practice scenarios remains limited. ObjectiveThis study aims to qu...

Full description

Saved in:
Bibliographic Details
Main Authors: Lulu Li, Pengqiang Du, Xiaojing Huang, Hongwei Zhao, Ming Ni, Meng Yan, Aifeng Wang
Format: Article
Language:English
Published: JMIR Publications 2025-07-01
Series:JMIR Medical Informatics
Online Access:https://medinform.jmir.org/2025/1/e76128
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract BackgroundGenerative artificial intelligence (AI) systems are increasingly deployed in clinical pharmacy; yet, systematic evaluation of their efficacy, limitations, and risks across diverse practice scenarios remains limited. ObjectiveThis study aims to quantitatively evaluate and compare the performance of 8 mainstream generative AI systems across 4 core clinical pharmacy scenarios—medication consultation, medication education, prescription review, and case analysis with pharmaceutical care—using a multidimensional framework. MethodsForty-eight clinically validated questions were selected via stratified sampling from real-world sources (eg, hospital consultations, clinical case banks, and national pharmacist training databases). Three researchers simultaneously tested 8 different generative AI systems (ERNIE Bot, Doubao, Kimi, Qwen, GPT-4o, Gemini-1.5-Pro, Claude-3.5-Sonnet, and DeepSeek-R1) using standardized prompts within a single day (February 20, 2025). A double-blind scoring design was used, with 6 experienced clinical pharmacists (≥5 years experience) evaluating the AI responses across 6 dimensions: accuracy, rigor, applicability, logical coherence, conciseness, and universality, scored 0‐10 per predefined criteria (eg, −3 for inaccuracy and −2 for incomplete rigor). Statistical analysis used one-way ANOVA with Tukey Honestly Significant Difference (HSD) post hoc testing and intraclass correlation coefficients (ICC) for interrater reliability (2-way random model). Qualitative thematic analysis identified recurrent errors and limitations. ResultsDeepSeek-R1 (DeepSeek) achieved the highest overall performance (mean composite score: medication consultation 9.4, SD 1.0; case analysis 9.3, SD 1.0), significantly outperforming others in complex tasks (PMycoplasma pneumoniaeP ConclusionsWhile generative AI shows promise as a pharmacist assistance tool, significant limitations—including high-risk errors (eg, contraindication omissions), inadequate localization, and complex reasoning gaps—preclude autonomous clinical decision-making. Performance stratification highlights DeepSeek-R1’s current advantage, but all systems require optimization in dynamic knowledge updating, complex scenario reasoning, and output interpretability. Future deployment must prioritize human oversight (human-AI co-review), ethical safeguards, and continuous evaluation frameworks.
ISSN:2291-9694