Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation

As a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an i...

Full description

Saved in:
Bibliographic Details
Main Authors: Li Yan, Xiaodong Zhou, Yaotian Wang, Xuan Chang, Qing Li, Gang Han
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11029188/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:As a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an intelligent analysis system based on a CLIP-GPT joint framework, integrating contrastive learning with generative pre-training for end-to-end image classification and diagnostic report generation. Utilizing an ultrasound dataset containing six types of liver lesions, we implement a multi-stage training strategy: first establishing visual-semantic cross-modal mapping through CLIP (ViT-B/32), followed by fine-tuning GPT-2 and GPT-3.5 to construct a structured report generator. Experimental results demonstrate that the proposed system achieves superior performance: classification accuracy reaches 96.4%, recall 95.1%, and F1-score 95.5%, significantly outperforming conventional CNN models (e.g., ResNet-50 with accuracy 89.6%). For report generation, the fine-tuned GPT-2 model achieves a BLEU-4 score of 32.5 and ROUGE-L score of 41.2, indicating strong alignment with clinical reporting standards. Key innovations include: a cross-modal feature decoupling-recombination mechanism bridging semantic gaps, clinical guideline-driven hierarchical templates ensuring professional standardization, and dynamic attention strategies enhancing lesion discrimination. This study provides an interpretable multimodal solution for medical image analysis, offering significant clinical value for intelligent diagnostic systems.
ISSN:2169-3536