Multimodal Retrieval Method for Images and Diagnostic Reports Using Cross-Attention

<b>Background:</b> Conventional medical image retrieval methods treat images and text as independent embeddings, limiting their ability to fully utilize the complementary information from both modalities. This separation often results in suboptimal retrieval performance, as the intricate...

Full description

Saved in:
Bibliographic Details
Main Authors: Ikumi Sata, Motoki Amagasaki, Masato Kiyama
Format: Article
Language:English
Published: MDPI AG 2025-02-01
Series:AI
Subjects:
Online Access:https://www.mdpi.com/2673-2688/6/2/38
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:<b>Background:</b> Conventional medical image retrieval methods treat images and text as independent embeddings, limiting their ability to fully utilize the complementary information from both modalities. This separation often results in suboptimal retrieval performance, as the intricate relationships between images and text remain underexplored. <b>Methods:</b> To address this limitation, we propose a novel retrieval method that integrates medical image and text embeddings using a cross-attention mechanism. Our approach creates a unified representation by directly modeling the interactions between the two modalities, significantly enhancing retrieval accuracy. <b>Results:</b> Built upon the pre-trained BioMedCLIP model, our method outperforms existing techniques across multiple metrics, achieving the highest mean Average Precision (mAP) on the MIMIC-CXR dataset. <b>Conclusions:</b> These results highlight the effectiveness of our method in advancing multimodal medical image retrieval and set the stage for further innovation in the field.
ISSN:2673-2688