LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep f...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/11002868/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849761342403117056 |
|---|---|
| author | Wenxuan Zheng Ying Yang |
| author_facet | Wenxuan Zheng Ying Yang |
| author_sort | Wenxuan Zheng |
| collection | DOAJ |
| description | To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism. This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures. Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance. Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIR-Fusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection. |
| format | Article |
| id | doaj-art-a6b49422b83e41ce85583d54de290c6b |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-a6b49422b83e41ce85583d54de290c6b2025-08-20T03:06:04ZengIEEEIEEE Access2169-35362025-01-0113913479136110.1109/ACCESS.2025.356965811002868LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing ImagesWenxuan Zheng0https://orcid.org/0009-0008-7351-480XYing Yang1https://orcid.org/0009-0004-7561-0329School of Physics and Electronic Information, Jiangsu Second Normal University, Nanjing, ChinaSchool of Physics and Electronic Information, Jiangsu Second Normal University, Nanjing, ChinaTo address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism. This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures. Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance. Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIR-Fusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection.https://ieeexplore.ieee.org/document/11002868/Transformeroptical remote sensingobject detectionfeature fusion |
| spellingShingle | Wenxuan Zheng Ying Yang LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images IEEE Access Transformer optical remote sensing object detection feature fusion |
| title | LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images |
| title_full | LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images |
| title_fullStr | LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images |
| title_full_unstemmed | LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images |
| title_short | LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images |
| title_sort | lq mixernet a cnn transformer deep fusion based model for object detection in optical remote sensing images |
| topic | Transformer optical remote sensing object detection feature fusion |
| url | https://ieeexplore.ieee.org/document/11002868/ |
| work_keys_str_mv | AT wenxuanzheng lqmixernetacnntransformerdeepfusionbasedmodelforobjectdetectioninopticalremotesensingimages AT yingyang lqmixernetacnntransformerdeepfusionbasedmodelforobjectdetectioninopticalremotesensingimages |