A Latent Multi-Scale Residual Transformer Approach for Cross-Modal Medical Image Synthesis
Cross-modal generation has emerged as a crucial method for addressing the challenge of filling in missing modalities in medical imaging. Existing approaches predominantly utilize convolutional neural networks (CNNs) or vision transformers (ViTs) and their variants as model backbones. Consequently, i...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10945304/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Cross-modal generation has emerged as a crucial method for addressing the challenge of filling in missing modalities in medical imaging. Existing approaches predominantly utilize convolutional neural networks (CNNs) or vision transformers (ViTs) and their variants as model backbones. Consequently, issues arise concerning limited receptive fields and significant increases in computational costs. This paper proposes a latent feature space-based multi-scale residual ViT generative adversarial model (LMRT-NET), which leverages the global sensitivity of ViTs and the local precision of CNNs while reducing computational costs. The generator of LMRT-NET comprises an encoder-decoder architecture that enhances performance and lowers computational expenses through the use of multi-scale dynamic aggregation residual ViT (DART) blocks in the latent feature space. This module consists of two layers of residual convolutional blocks and transformer blocks of different scales, where the transformer blocks assist the convolutional blocks in capturing contextual features, and lower-level blocks support higher-level blocks in learning high-dimensional global information. Additionally, a multi-level information fusion (MIF) module is integrated into the encoder-decoder and latent feature space, consisting of a dual-scale selective fusion (DSF) module that adaptively aggregates multi-scale information to generate target modality images. Extensive experimental results across three different datasets demonstrate that LMRT-NET outperforms baseline methods in terms of image generation quality and generalization capability. Our code will be released on <uri>https://github.com/ffan14/LMRT-NET</uri>. |
|---|---|
| ISSN: | 2169-3536 |