Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation
As a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an i...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11029188/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849417629824974848 |
|---|---|
| author | Li Yan Xiaodong Zhou Yaotian Wang Xuan Chang Qing Li Gang Han |
| author_facet | Li Yan Xiaodong Zhou Yaotian Wang Xuan Chang Qing Li Gang Han |
| author_sort | Li Yan |
| collection | DOAJ |
| description | As a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an intelligent analysis system based on a CLIP-GPT joint framework, integrating contrastive learning with generative pre-training for end-to-end image classification and diagnostic report generation. Utilizing an ultrasound dataset containing six types of liver lesions, we implement a multi-stage training strategy: first establishing visual-semantic cross-modal mapping through CLIP (ViT-B/32), followed by fine-tuning GPT-2 and GPT-3.5 to construct a structured report generator. Experimental results demonstrate that the proposed system achieves superior performance: classification accuracy reaches 96.4%, recall 95.1%, and F1-score 95.5%, significantly outperforming conventional CNN models (e.g., ResNet-50 with accuracy 89.6%). For report generation, the fine-tuned GPT-2 model achieves a BLEU-4 score of 32.5 and ROUGE-L score of 41.2, indicating strong alignment with clinical reporting standards. Key innovations include: a cross-modal feature decoupling-recombination mechanism bridging semantic gaps, clinical guideline-driven hierarchical templates ensuring professional standardization, and dynamic attention strategies enhancing lesion discrimination. This study provides an interpretable multimodal solution for medical image analysis, offering significant clinical value for intelligent diagnostic systems. |
| format | Article |
| id | doaj-art-3f7c7fcf17ff44b9bdc74726830269bf |
| institution | Kabale University |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-3f7c7fcf17ff44b9bdc74726830269bf2025-08-20T03:32:42ZengIEEEIEEE Access2169-35362025-01-011310795010796010.1109/ACCESS.2025.357846211029188Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report GenerationLi Yan0https://orcid.org/0000-0001-9079-8826Xiaodong Zhou1Yaotian Wang2https://orcid.org/0009-0005-5248-1242Xuan Chang3https://orcid.org/0009-0004-2775-3536Qing Li4Gang Han5https://orcid.org/0000-0002-2305-6870Institute of Medical Research, Northwestern Polytechnical University, Xi’an, ChinaUltrasound Diagnosis and Treatment Center, Xi’an International Medical Center Hospital, Xi’an, ChinaThe School of Cyberspace Security, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, ChinaThe School of Cyberspace Security, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, ChinaUltrasound Diagnosis and Treatment Center, Xi’an International Medical Center Hospital, Xi’an, ChinaThe School of Cyberspace Security, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, ChinaAs a crucial non-invasive imaging modality in clinical diagnosis, ultrasound interpretation faces challenges of subjectivity and inefficiency. To address the limitations of traditional single-modal deep learning models in cross-modal alignment and structured text generation, this study proposes an intelligent analysis system based on a CLIP-GPT joint framework, integrating contrastive learning with generative pre-training for end-to-end image classification and diagnostic report generation. Utilizing an ultrasound dataset containing six types of liver lesions, we implement a multi-stage training strategy: first establishing visual-semantic cross-modal mapping through CLIP (ViT-B/32), followed by fine-tuning GPT-2 and GPT-3.5 to construct a structured report generator. Experimental results demonstrate that the proposed system achieves superior performance: classification accuracy reaches 96.4%, recall 95.1%, and F1-score 95.5%, significantly outperforming conventional CNN models (e.g., ResNet-50 with accuracy 89.6%). For report generation, the fine-tuned GPT-2 model achieves a BLEU-4 score of 32.5 and ROUGE-L score of 41.2, indicating strong alignment with clinical reporting standards. Key innovations include: a cross-modal feature decoupling-recombination mechanism bridging semantic gaps, clinical guideline-driven hierarchical templates ensuring professional standardization, and dynamic attention strategies enhancing lesion discrimination. This study provides an interpretable multimodal solution for medical image analysis, offering significant clinical value for intelligent diagnostic systems.https://ieeexplore.ieee.org/document/11029188/Ultrasound image analysisCLIP-GPT joint frameworkcross-modal semantic alignmentstructured report generationcontrastive learninggenerative pre-training |
| spellingShingle | Li Yan Xiaodong Zhou Yaotian Wang Xuan Chang Qing Li Gang Han Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation IEEE Access Ultrasound image analysis CLIP-GPT joint framework cross-modal semantic alignment structured report generation contrastive learning generative pre-training |
| title | Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation |
| title_full | Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation |
| title_fullStr | Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation |
| title_full_unstemmed | Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation |
| title_short | Automated Ultrasound Diagnosis via CLIP-GPT Synergy: A Multimodal Framework for Image Classification and Report Generation |
| title_sort | automated ultrasound diagnosis via clip gpt synergy a multimodal framework for image classification and report generation |
| topic | Ultrasound image analysis CLIP-GPT joint framework cross-modal semantic alignment structured report generation contrastive learning generative pre-training |
| url | https://ieeexplore.ieee.org/document/11029188/ |
| work_keys_str_mv | AT liyan automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration AT xiaodongzhou automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration AT yaotianwang automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration AT xuanchang automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration AT qingli automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration AT ganghan automatedultrasounddiagnosisviaclipgptsynergyamultimodalframeworkforimageclassificationandreportgeneration |