A Latent Multi-Scale Residual Transformer Approach for Cross-Modal Medical Image Synthesis

Cross-modal generation has emerged as a crucial method for addressing the challenge of filling in missing modalities in medical imaging. Existing approaches predominantly utilize convolutional neural networks (CNNs) or vision transformers (ViTs) and their variants as model backbones. Consequently, i...

Full description

Saved in:
Bibliographic Details
Main Authors: Xinmiao Zhu, Yang Li
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10945304/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Cross-modal generation has emerged as a crucial method for addressing the challenge of filling in missing modalities in medical imaging. Existing approaches predominantly utilize convolutional neural networks (CNNs) or vision transformers (ViTs) and their variants as model backbones. Consequently, issues arise concerning limited receptive fields and significant increases in computational costs. This paper proposes a latent feature space-based multi-scale residual ViT generative adversarial model (LMRT-NET), which leverages the global sensitivity of ViTs and the local precision of CNNs while reducing computational costs. The generator of LMRT-NET comprises an encoder-decoder architecture that enhances performance and lowers computational expenses through the use of multi-scale dynamic aggregation residual ViT (DART) blocks in the latent feature space. This module consists of two layers of residual convolutional blocks and transformer blocks of different scales, where the transformer blocks assist the convolutional blocks in capturing contextual features, and lower-level blocks support higher-level blocks in learning high-dimensional global information. Additionally, a multi-level information fusion (MIF) module is integrated into the encoder-decoder and latent feature space, consisting of a dual-scale selective fusion (DSF) module that adaptively aggregates multi-scale information to generate target modality images. Extensive experimental results across three different datasets demonstrate that LMRT-NET outperforms baseline methods in terms of image generation quality and generalization capability. Our code will be released on <uri>https://github.com/ffan14/LMRT-NET</uri>.
ISSN:2169-3536