Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis

Recent advancements in large language models (LLMs), particularly GPT-3.5 and GPT-4, have sparked significant interest in their application within the medical field. This research offers a detailed comparative analysis of the abilities of GPT-3.5 and GPT-4 in the context of annotating radiology repo...

Full description

Saved in:
Bibliographic Details
Main Authors: Bayan Altalla’, Ashraf Ahmad, Layla Bitar, Mohammed Al-Bssol, Amal Al Omari, Iyad Sultan
Format: Article
Language:English
Published: Wiley 2025-01-01
Series:International Journal of Biomedical Imaging
Online Access:http://dx.doi.org/10.1155/ijbi/5019035
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849721409696169984
author Bayan Altalla’
Ashraf Ahmad
Layla Bitar
Mohammed Al-Bssol
Amal Al Omari
Iyad Sultan
author_facet Bayan Altalla’
Ashraf Ahmad
Layla Bitar
Mohammed Al-Bssol
Amal Al Omari
Iyad Sultan
author_sort Bayan Altalla’
collection DOAJ
description Recent advancements in large language models (LLMs), particularly GPT-3.5 and GPT-4, have sparked significant interest in their application within the medical field. This research offers a detailed comparative analysis of the abilities of GPT-3.5 and GPT-4 in the context of annotating radiology reports and generating impressions from chest computed tomography (CT) scans. The primary objective is to use these models to assist healthcare professionals in handling routine documentation tasks. Employing methods such as in-context learning (ICL) and retrieval-augmented generation (RAG), the study focused on generating impression sections from radiological findings. Comprehensive evaluation was applied using a variety of metrics, including recall-oriented understudy for gisting evaluation (ROUGE) for n-gram analysis, Instructor Similarity for contextual similarity, and BERTScore for semantic similarity, to assess the performance of these models. The study shows distinct performance differences between GPT-3.5 and GPT-4 across both zero-shot and few-shot learning scenarios. It was observed that certain prompts significantly influenced the performance outcomes, with specific prompts leading to more accurate impressions. The RAG method achieved a superior BERTScore of 0.92, showcasing its ability to generate semantically rich and contextually accurate impressions. In contrast, GPT-3.5 and GPT-4 excel in preserving language tone, with Instructor Similarity scores of approximately 0.92 across scenarios, underscoring the importance of prompt design in effective summarization tasks. The findings of this research emphasize the critical role of prompt design in optimizing model efficacy and point to the significant potential for further exploration in prompt engineering. Moreover, the study advocates for the standardized integration of such advanced LLMs in healthcare practices, highlighting their potential to enhance the efficiency and accuracy of medical documentation.
format Article
id doaj-art-e3ba180940cb454983efdc9dadd69e28
institution DOAJ
issn 1687-4196
language English
publishDate 2025-01-01
publisher Wiley
record_format Article
series International Journal of Biomedical Imaging
spelling doaj-art-e3ba180940cb454983efdc9dadd69e282025-08-20T03:11:40ZengWileyInternational Journal of Biomedical Imaging1687-41962025-01-01202510.1155/ijbi/5019035Radiology Report Annotation Using Generative Large Language Models: Comparative AnalysisBayan Altalla’0Ashraf Ahmad1Layla Bitar2Mohammed Al-Bssol3Amal Al Omari4Iyad Sultan5Office of Scientific Affairs and ResearchSchool of Computing SciencesArtificial Intelligence and Data Innovation OfficeOffice of Scientific Affairs and ResearchOffice of Scientific Affairs and ResearchArtificial Intelligence and Data Innovation OfficeRecent advancements in large language models (LLMs), particularly GPT-3.5 and GPT-4, have sparked significant interest in their application within the medical field. This research offers a detailed comparative analysis of the abilities of GPT-3.5 and GPT-4 in the context of annotating radiology reports and generating impressions from chest computed tomography (CT) scans. The primary objective is to use these models to assist healthcare professionals in handling routine documentation tasks. Employing methods such as in-context learning (ICL) and retrieval-augmented generation (RAG), the study focused on generating impression sections from radiological findings. Comprehensive evaluation was applied using a variety of metrics, including recall-oriented understudy for gisting evaluation (ROUGE) for n-gram analysis, Instructor Similarity for contextual similarity, and BERTScore for semantic similarity, to assess the performance of these models. The study shows distinct performance differences between GPT-3.5 and GPT-4 across both zero-shot and few-shot learning scenarios. It was observed that certain prompts significantly influenced the performance outcomes, with specific prompts leading to more accurate impressions. The RAG method achieved a superior BERTScore of 0.92, showcasing its ability to generate semantically rich and contextually accurate impressions. In contrast, GPT-3.5 and GPT-4 excel in preserving language tone, with Instructor Similarity scores of approximately 0.92 across scenarios, underscoring the importance of prompt design in effective summarization tasks. The findings of this research emphasize the critical role of prompt design in optimizing model efficacy and point to the significant potential for further exploration in prompt engineering. Moreover, the study advocates for the standardized integration of such advanced LLMs in healthcare practices, highlighting their potential to enhance the efficiency and accuracy of medical documentation.http://dx.doi.org/10.1155/ijbi/5019035
spellingShingle Bayan Altalla’
Ashraf Ahmad
Layla Bitar
Mohammed Al-Bssol
Amal Al Omari
Iyad Sultan
Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis
International Journal of Biomedical Imaging
title Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis
title_full Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis
title_fullStr Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis
title_full_unstemmed Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis
title_short Radiology Report Annotation Using Generative Large Language Models: Comparative Analysis
title_sort radiology report annotation using generative large language models comparative analysis
url http://dx.doi.org/10.1155/ijbi/5019035
work_keys_str_mv AT bayanaltalla radiologyreportannotationusinggenerativelargelanguagemodelscomparativeanalysis
AT ashrafahmad radiologyreportannotationusinggenerativelargelanguagemodelscomparativeanalysis
AT laylabitar radiologyreportannotationusinggenerativelargelanguagemodelscomparativeanalysis
AT mohammedalbssol radiologyreportannotationusinggenerativelargelanguagemodelscomparativeanalysis
AT amalalomari radiologyreportannotationusinggenerativelargelanguagemodelscomparativeanalysis
AT iyadsultan radiologyreportannotationusinggenerativelargelanguagemodelscomparativeanalysis