Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation

As a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an i...

Full description

Saved in:
Bibliographic Details
Main Authors: Li Yan, Xiaodong Zhou, Yaotian Wang, Xuan Chang, Qing Li, Gang Han
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11029188/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849417629824974848
author Li Yan
Xiaodong Zhou
Yaotian Wang
Xuan Chang
Qing Li
Gang Han
author_facet Li Yan
Xiaodong Zhou
Yaotian Wang
Xuan Chang
Qing Li
Gang Han
author_sort Li Yan
collection DOAJ
description As a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an intelligent analysis system based on a CLIP-GPT joint framework, integrating contrastive learning with generative pre-training for end-to-end image classification and diagnostic report generation. Utilizing an ultrasound dataset containing six types of liver lesions, we implement a multi-stage training strategy: first establishing visual-semantic cross-modal mapping through CLIP (ViT-B/32), followed by fine-tuning GPT-2 and GPT-3.5 to construct a structured report generator. Experimental results demonstrate that the proposed system achieves superior performance: classification accuracy reaches 96.4%, recall 95.1%, and F1-score 95.5%, significantly outperforming conventional CNN models (e.g., ResNet-50 with accuracy 89.6%). For report generation, the fine-tuned GPT-2 model achieves a BLEU-4 score of 32.5 and ROUGE-L score of 41.2, indicating strong alignment with clinical reporting standards. Key innovations include: a cross-modal feature decoupling-recombination mechanism bridging semantic gaps, clinical guideline-driven hierarchical templates ensuring professional standardization, and dynamic attention strategies enhancing lesion discrimination. This study provides an interpretable multimodal solution for medical image analysis, offering significant clinical value for intelligent diagnostic systems.
format Article
id doaj-art-3f7c7fcf17ff44b9bdc74726830269bf
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-3f7c7fcf17ff44b9bdc74726830269bf2025-08-20T03:32:42ZengIEEEIEEE Access2169-35362025-01-011310795010796010.1109/ACCESS.2025.357846211029188Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report GenerationLi Yan0https://orcid.org/0000-0001-9079-8826Xiaodong Zhou1Yaotian Wang2https://orcid.org/0009-0005-5248-1242Xuan Chang3https://orcid.org/0009-0004-2775-3536Qing Li4Gang Han5https://orcid.org/0000-0002-2305-6870Institute of Medical Research, Northwestern Polytechnical University, Xi’an, ChinaUltrasound Diagnosis and Treatment Center, Xi’an International Medical Center Hospital, Xi’an, ChinaThe School of Cyberspace Security, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, ChinaThe School of Cyberspace Security, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, ChinaUltrasound Diagnosis and Treatment Center, Xi’an International Medical Center Hospital, Xi’an, ChinaThe School of Cyberspace Security, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, ChinaAs a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an intelligent analysis system based on a CLIP-GPT joint framework, integrating contrastive learning with generative pre-training for end-to-end image classification and diagnostic report generation. Utilizing an ultrasound dataset containing six types of liver lesions, we implement a multi-stage training strategy: first establishing visual-semantic cross-modal mapping through CLIP (ViT-B/32), followed by fine-tuning GPT-2 and GPT-3.5 to construct a structured report generator. Experimental results demonstrate that the proposed system achieves superior performance: classification accuracy reaches 96.4%, recall 95.1%, and F1-score 95.5%, significantly outperforming conventional CNN models (e.g., ResNet-50 with accuracy 89.6%). For report generation, the fine-tuned GPT-2 model achieves a BLEU-4 score of 32.5 and ROUGE-L score of 41.2, indicating strong alignment with clinical reporting standards. Key innovations include: a cross-modal feature decoupling-recombination mechanism bridging semantic gaps, clinical guideline-driven hierarchical templates ensuring professional standardization, and dynamic attention strategies enhancing lesion discrimination. This study provides an interpretable multimodal solution for medical image analysis, offering significant clinical value for intelligent diagnostic systems.https://ieeexplore.ieee.org/document/11029188/Ultrasound image analysisCLIP-GPT joint frameworkcross-modal semantic alignmentstructured report generationcontrastive learninggenerative pre-training
spellingShingle Li Yan
Xiaodong Zhou
Yaotian Wang
Xuan Chang
Qing Li
Gang Han
Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation
IEEE Access
Ultrasound image analysis
CLIP-GPT joint framework
cross-modal semantic alignment
structured report generation
contrastive learning
generative pre-training
title Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation
title_full Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation
title_fullStr Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation
title_full_unstemmed Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation
title_short Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation
title_sort automated ultrasound diagnosis via clip gpt synergy a multimodal framework for image classification and report generation
topic Ultrasound image analysis
CLIP-GPT joint framework
cross-modal semantic alignment
structured report generation
contrastive learning
generative pre-training
url https://ieeexplore.ieee.org/document/11029188/
work_keys_str_mv AT liyan automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration
AT xiaodongzhou automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration
AT yaotianwang automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration
AT xuanchang automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration
AT qingli automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration
AT ganghan automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration