Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery

Target detection in remote sensing images has garnered significant attention due to its wide range of applications. Many traditional methods primarily rely on unimodal data, which often struggle to address the complexities of remote sensing environments. Furthermore, small-target detection remains a...

Full description

Saved in:
Bibliographic Details
Main Authors: Yong Wang, Jiexuan Jia, Rui Liu, Qiusheng Cao, Jie Feng, Danping Li, Lei Wang
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Remote Sensing
Subjects:
Online Access:https://www.mdpi.com/2072-4292/17/8/1350
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849713863260372992
author Yong Wang
Jiexuan Jia
Rui Liu
Qiusheng Cao
Jie Feng
Danping Li
Lei Wang
author_facet Yong Wang
Jiexuan Jia
Rui Liu
Qiusheng Cao
Jie Feng
Danping Li
Lei Wang
author_sort Yong Wang
collection DOAJ
description Target detection in remote sensing images has garnered significant attention due to its wide range of applications. Many traditional methods primarily rely on unimodal data, which often struggle to address the complexities of remote sensing environments. Furthermore, small-target detection remains a critical challenge in remote sensing image analysis, as small targets occupy only a few pixels, making feature extraction difficult and prone to errors. To address these challenges, this paper revisits the existing multimodal fusion methodologies and proposes a novel framework of separation before fusion (SBF). Leveraging this framework, we present Sep-Fusion—an efficient target detection approach tailored for multimodal remote sensing aerial imagery. Within the modality separation module (MSM), the method separates the three RGB channels of visible light images into independent modalities aligned with infrared image channels. Each channel undergoes independent feature extraction through the unimodal block (UB) to effectively capture modality-specific features. The extracted features are then fused using the feature attention fusion (FAF) module, which integrates channel attention and spatial attention mechanisms to enhance multimodal feature interaction. To improve the detection of small targets, an image regeneration module is exploited during the training stage. It incorporates the super-resolution strategy with attention mechanisms to further optimize high-resolution feature representations for subsequent positioning and detection. Sep-Fusion is currently developed on the YOLO series to make itself a potential real-time detector. Its lightweight architecture enables the model to achieve high computational efficiency while maintaining the desired detection accuracy. Experimental results on the multimodal VEDAI dataset show that Sep-Fusion achieves 77.9% mAP50, surpassing many state-of-the-art models. Ablation experiments further illustrate the respective contribution of modality separation and attention fusion. The adaptation of our multimodal method to unimodal target detection is also verified on NWPU VHR-10 and DIOR datasets, which proves Sep-Fusion to be a suitable alternative to current detectors in various remote sensing scenarios.
format Article
id doaj-art-22336246a4414fdd804ff70a264f044c
institution DOAJ
issn 2072-4292
language English
publishDate 2025-04-01
publisher MDPI AG
record_format Article
series Remote Sensing
spelling doaj-art-22336246a4414fdd804ff70a264f044c2025-08-20T03:13:51ZengMDPI AGRemote Sensing2072-42922025-04-01178135010.3390/rs17081350Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing ImageryYong Wang0Jiexuan Jia1Rui Liu2Qiusheng Cao3Jie Feng4Danping Li5Lei Wang6School of Electronic Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaSchool of Telecommunications Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaTarget detection in remote sensing images has garnered significant attention due to its wide range of applications. Many traditional methods primarily rely on unimodal data, which often struggle to address the complexities of remote sensing environments. Furthermore, small-target detection remains a critical challenge in remote sensing image analysis, as small targets occupy only a few pixels, making feature extraction difficult and prone to errors. To address these challenges, this paper revisits the existing multimodal fusion methodologies and proposes a novel framework of separation before fusion (SBF). Leveraging this framework, we present Sep-Fusion—an efficient target detection approach tailored for multimodal remote sensing aerial imagery. Within the modality separation module (MSM), the method separates the three RGB channels of visible light images into independent modalities aligned with infrared image channels. Each channel undergoes independent feature extraction through the unimodal block (UB) to effectively capture modality-specific features. The extracted features are then fused using the feature attention fusion (FAF) module, which integrates channel attention and spatial attention mechanisms to enhance multimodal feature interaction. To improve the detection of small targets, an image regeneration module is exploited during the training stage. It incorporates the super-resolution strategy with attention mechanisms to further optimize high-resolution feature representations for subsequent positioning and detection. Sep-Fusion is currently developed on the YOLO series to make itself a potential real-time detector. Its lightweight architecture enables the model to achieve high computational efficiency while maintaining the desired detection accuracy. Experimental results on the multimodal VEDAI dataset show that Sep-Fusion achieves 77.9% mAP50, surpassing many state-of-the-art models. Ablation experiments further illustrate the respective contribution of modality separation and attention fusion. The adaptation of our multimodal method to unimodal target detection is also verified on NWPU VHR-10 and DIOR datasets, which proves Sep-Fusion to be a suitable alternative to current detectors in various remote sensing scenarios.https://www.mdpi.com/2072-4292/17/8/1350multimodal detectionseparation before fusionattention mechanismremote sensing imagessmall-target detection
spellingShingle Yong Wang
Jiexuan Jia
Rui Liu
Qiusheng Cao
Jie Feng
Danping Li
Lei Wang
Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery
Remote Sensing
multimodal detection
separation before fusion
attention mechanism
remote sensing images
small-target detection
title Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery
title_full Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery
title_fullStr Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery
title_full_unstemmed Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery
title_short Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery
title_sort can separation enhance fusion an efficient framework for target detection in multimodal remote sensing imagery
topic multimodal detection
separation before fusion
attention mechanism
remote sensing images
small-target detection
url https://www.mdpi.com/2072-4292/17/8/1350
work_keys_str_mv AT yongwang canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery
AT jiexuanjia canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery
AT ruiliu canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery
AT qiushengcao canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery
AT jiefeng canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery
AT danpingli canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery
AT leiwang canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery