Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery
Target detection in remote sensing images has garnered significant attention due to its wide range of applications. Many traditional methods primarily rely on unimodal data, which often struggle to address the complexities of remote sensing environments. Furthermore, small-target detection remains a...
Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Remote Sensing |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2072-4292/17/8/1350 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849713863260372992 |
|---|---|
| author | Yong Wang Jiexuan Jia Rui Liu Qiusheng Cao Jie Feng Danping Li Lei Wang |
| author_facet | Yong Wang Jiexuan Jia Rui Liu Qiusheng Cao Jie Feng Danping Li Lei Wang |
| author_sort | Yong Wang |
| collection | DOAJ |
| description | Target detection in remote sensing images has garnered significant attention due to its wide range of applications. Many traditional methods primarily rely on unimodal data, which often struggle to address the complexities of remote sensing environments. Furthermore, small-target detection remains a critical challenge in remote sensing image analysis, as small targets occupy only a few pixels, making feature extraction difficult and prone to errors. To address these challenges, this paper revisits the existing multimodal fusion methodologies and proposes a novel framework of separation before fusion (SBF). Leveraging this framework, we present Sep-Fusion—an efficient target detection approach tailored for multimodal remote sensing aerial imagery. Within the modality separation module (MSM), the method separates the three RGB channels of visible light images into independent modalities aligned with infrared image channels. Each channel undergoes independent feature extraction through the unimodal block (UB) to effectively capture modality-specific features. The extracted features are then fused using the feature attention fusion (FAF) module, which integrates channel attention and spatial attention mechanisms to enhance multimodal feature interaction. To improve the detection of small targets, an image regeneration module is exploited during the training stage. It incorporates the super-resolution strategy with attention mechanisms to further optimize high-resolution feature representations for subsequent positioning and detection. Sep-Fusion is currently developed on the YOLO series to make itself a potential real-time detector. Its lightweight architecture enables the model to achieve high computational efficiency while maintaining the desired detection accuracy. Experimental results on the multimodal VEDAI dataset show that Sep-Fusion achieves 77.9% mAP50, surpassing many state-of-the-art models. Ablation experiments further illustrate the respective contribution of modality separation and attention fusion. The adaptation of our multimodal method to unimodal target detection is also verified on NWPU VHR-10 and DIOR datasets, which proves Sep-Fusion to be a suitable alternative to current detectors in various remote sensing scenarios. |
| format | Article |
| id | doaj-art-22336246a4414fdd804ff70a264f044c |
| institution | DOAJ |
| issn | 2072-4292 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Remote Sensing |
| spelling | doaj-art-22336246a4414fdd804ff70a264f044c2025-08-20T03:13:51ZengMDPI AGRemote Sensing2072-42922025-04-01178135010.3390/rs17081350Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing ImageryYong Wang0Jiexuan Jia1Rui Liu2Qiusheng Cao3Jie Feng4Danping Li5Lei Wang6School of Electronic Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaKey Laboratory of Intelligent Perception and Image Understanding of Ministry of Education of China, School of Artificial Intelligence, Xidian University, Xi’an 710071, ChinaSchool of Telecommunications Engineering, Xidian University, Xi’an 710071, ChinaSchool of Electronic Engineering, Xidian University, Xi’an 710071, ChinaTarget detection in remote sensing images has garnered significant attention due to its wide range of applications. Many traditional methods primarily rely on unimodal data, which often struggle to address the complexities of remote sensing environments. Furthermore, small-target detection remains a critical challenge in remote sensing image analysis, as small targets occupy only a few pixels, making feature extraction difficult and prone to errors. To address these challenges, this paper revisits the existing multimodal fusion methodologies and proposes a novel framework of separation before fusion (SBF). Leveraging this framework, we present Sep-Fusion—an efficient target detection approach tailored for multimodal remote sensing aerial imagery. Within the modality separation module (MSM), the method separates the three RGB channels of visible light images into independent modalities aligned with infrared image channels. Each channel undergoes independent feature extraction through the unimodal block (UB) to effectively capture modality-specific features. The extracted features are then fused using the feature attention fusion (FAF) module, which integrates channel attention and spatial attention mechanisms to enhance multimodal feature interaction. To improve the detection of small targets, an image regeneration module is exploited during the training stage. It incorporates the super-resolution strategy with attention mechanisms to further optimize high-resolution feature representations for subsequent positioning and detection. Sep-Fusion is currently developed on the YOLO series to make itself a potential real-time detector. Its lightweight architecture enables the model to achieve high computational efficiency while maintaining the desired detection accuracy. Experimental results on the multimodal VEDAI dataset show that Sep-Fusion achieves 77.9% mAP50, surpassing many state-of-the-art models. Ablation experiments further illustrate the respective contribution of modality separation and attention fusion. The adaptation of our multimodal method to unimodal target detection is also verified on NWPU VHR-10 and DIOR datasets, which proves Sep-Fusion to be a suitable alternative to current detectors in various remote sensing scenarios.https://www.mdpi.com/2072-4292/17/8/1350multimodal detectionseparation before fusionattention mechanismremote sensing imagessmall-target detection |
| spellingShingle | Yong Wang Jiexuan Jia Rui Liu Qiusheng Cao Jie Feng Danping Li Lei Wang Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery Remote Sensing multimodal detection separation before fusion attention mechanism remote sensing images small-target detection |
| title | Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery |
| title_full | Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery |
| title_fullStr | Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery |
| title_full_unstemmed | Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery |
| title_short | Can Separation Enhance Fusion? An Efficient Framework for Target Detection in Multimodal Remote Sensing Imagery |
| title_sort | can separation enhance fusion an efficient framework for target detection in multimodal remote sensing imagery |
| topic | multimodal detection separation before fusion attention mechanism remote sensing images small-target detection |
| url | https://www.mdpi.com/2072-4292/17/8/1350 |
| work_keys_str_mv | AT yongwang canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery AT jiexuanjia canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery AT ruiliu canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery AT qiushengcao canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery AT jiefeng canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery AT danpingli canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery AT leiwang canseparationenhancefusionanefficientframeworkfortargetdetectioninmultimodalremotesensingimagery |