Cross-Modal Augmented Transformer for Automated Medical Report Generation

In clinical practice, interpreting medical images and composing diagnostic reports typically involve significant manual workload. Therefore, an automated report generation framework that mimics a doctor’s diagnosis better meets the requirements of medical scenarios. Prior investigations o...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yuhao Tang, Ye Yuan, Fei Tao, Minghao Tang
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Translational Engineering in Health and Medicine
Subjects:	Medical report generation medical imaging automatic diagnosis clinical automation image captioning
Online Access:	https://ieeexplore.ieee.org/document/10857391/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823859634998345728
author	Yuhao Tang Ye Yuan Fei Tao Minghao Tang
author_facet	Yuhao Tang Ye Yuan Fei Tao Minghao Tang
author_sort	Yuhao Tang
collection	DOAJ
description	In clinical practice, interpreting medical images and composing diagnostic reports typically involve significant manual workload. Therefore, an automated report generation framework that mimics a doctor’s diagnosis better meets the requirements of medical scenarios. Prior investigations often overlook this critical aspect, primarily relying on traditional image captioning frameworks initially designed for general-domain images and sentences. Despite achieving some advancements, these methodologies encounter two primary challenges. First, the strong noise in blurred medical images always hinders the model of capturing the lesion region. Second, during report writing, doctors typically rely on terminology for diagnosis, a crucial aspect that has been neglected in prior frameworks. In this paper, we present a novel approach called Cross-modal Augmented Transformer (CAT) for medical report generation. Unlike previous methods that rely on coarse-grained features without human intervention, our method introduces a “locate then generate” pattern, thereby improving the interpretability of the generated reports. During the locate stage, CAT captures crucial representations by pre-aligning significant patches and their corresponding medical terminologies. This pre-alignment helps reduce visual noise by discarding low-ranking content, ensuring that only relevant information is considered in the report generation process. During the generation phase, CAT utilizes a multi-modality encoder to reinforce the correlation between generated keywords, retrieved terminologies and regions. Furthermore, CAT employs a dual-stream decoder that dynamically determines whether the predicted word should be influenced by the retrieved terminology or the preceding sentence. Experimental results demonstrate the effectiveness of the proposed method on two datasets.Clinical impact: This work aims to design an automated framework for explaining medical images to evaluate the health status of individuals, thereby facilitating their broader application in clinical settings.Clinical and Translational Impact Statement: In our preclinical research, we develop an automated system for generating diagnostic reports. This system mimics manual diagnostic methods by combining fine-grained semantic alignment with dual-stream decoders.
format	Article
id	doaj-art-b1c24d87c1e14bb7b864ac979835b4aa
institution	Kabale University
issn	2168-2372
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Journal of Translational Engineering in Health and Medicine
spelling	doaj-art-b1c24d87c1e14bb7b864ac979835b4aa2025-02-11T00:00:34ZengIEEEIEEE Journal of Translational Engineering in Health and Medicine2168-23722025-01-0113334810.1109/JTEHM.2025.353644110857391Cross-Modal Augmented Transformer for Automated Medical Report GenerationYuhao Tang0Ye Yuan1https://orcid.org/0009-0005-3886-3527Fei Tao2Minghao Tang3https://orcid.org/0009-0007-0440-6007Jiangsu Police Institute, Nanjing, ChinaJiangsu Provincial Branch of the Industrial and Commercial Bank of China, Nanjing, ChinaYangzhou Intermediate People’s Court of Jiangsu Province, Yangzhou, ChinaThe First People’s Hospital of Jiashan, Jiaxing, ChinaIn clinical practice, interpreting medical images and composing diagnostic reports typically involve significant manual workload. Therefore, an automated report generation framework that mimics a doctor’s diagnosis better meets the requirements of medical scenarios. Prior investigations often overlook this critical aspect, primarily relying on traditional image captioning frameworks initially designed for general-domain images and sentences. Despite achieving some advancements, these methodologies encounter two primary challenges. First, the strong noise in blurred medical images always hinders the model of capturing the lesion region. Second, during report writing, doctors typically rely on terminology for diagnosis, a crucial aspect that has been neglected in prior frameworks. In this paper, we present a novel approach called Cross-modal Augmented Transformer (CAT) for medical report generation. Unlike previous methods that rely on coarse-grained features without human intervention, our method introduces a “locate then generate” pattern, thereby improving the interpretability of the generated reports. During the locate stage, CAT captures crucial representations by pre-aligning significant patches and their corresponding medical terminologies. This pre-alignment helps reduce visual noise by discarding low-ranking content, ensuring that only relevant information is considered in the report generation process. During the generation phase, CAT utilizes a multi-modality encoder to reinforce the correlation between generated keywords, retrieved terminologies and regions. Furthermore, CAT employs a dual-stream decoder that dynamically determines whether the predicted word should be influenced by the retrieved terminology or the preceding sentence. Experimental results demonstrate the effectiveness of the proposed method on two datasets.Clinical impact: This work aims to design an automated framework for explaining medical images to evaluate the health status of individuals, thereby facilitating their broader application in clinical settings.Clinical and Translational Impact Statement: In our preclinical research, we develop an automated system for generating diagnostic reports. This system mimics manual diagnostic methods by combining fine-grained semantic alignment with dual-stream decoders.https://ieeexplore.ieee.org/document/10857391/Medical report generationmedical imagingautomatic diagnosisclinical automationimage captioning
spellingShingle	Yuhao Tang Ye Yuan Fei Tao Minghao Tang Cross-Modal Augmented Transformer for Automated Medical Report Generation IEEE Journal of Translational Engineering in Health and Medicine Medical report generation medical imaging automatic diagnosis clinical automation image captioning
title	Cross-Modal Augmented Transformer for Automated Medical Report Generation
title_full	Cross-Modal Augmented Transformer for Automated Medical Report Generation
title_fullStr	Cross-Modal Augmented Transformer for Automated Medical Report Generation
title_full_unstemmed	Cross-Modal Augmented Transformer for Automated Medical Report Generation
title_short	Cross-Modal Augmented Transformer for Automated Medical Report Generation
title_sort	cross modal augmented transformer for automated medical report generation
topic	Medical report generation medical imaging automatic diagnosis clinical automation image captioning
url	https://ieeexplore.ieee.org/document/10857391/
work_keys_str_mv	AT yuhaotang crossmodalaugmentedtransformerforautomatedmedicalreportgeneration AT yeyuan crossmodalaugmentedtransformerforautomatedmedicalreportgeneration AT feitao crossmodalaugmentedtransformerforautomatedmedicalreportgeneration AT minghaotang crossmodalaugmentedtransformerforautomatedmedicalreportgeneration

Cross-Modal Augmented Transformer for Automated Medical Report Generation

Similar Items