LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images

To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep f...

Full description

Saved in:
Bibliographic Details
Main Authors: Wenxuan Zheng, Ying Yang
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11002868/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849761342403117056
author Wenxuan Zheng
Ying Yang
author_facet Wenxuan Zheng
Ying Yang
author_sort Wenxuan Zheng
collection DOAJ
description To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism. This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures. Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance. Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIR-Fusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection.
format Article
id doaj-art-a6b49422b83e41ce85583d54de290c6b
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-a6b49422b83e41ce85583d54de290c6b2025-08-20T03:06:04ZengIEEEIEEE Access2169-35362025-01-0113913479136110.1109/ACCESS.2025.356965811002868LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing ImagesWenxuan Zheng0https://orcid.org/0009-0008-7351-480XYing Yang1https://orcid.org/0009-0004-7561-0329School of Physics and Electronic Information, Jiangsu Second Normal University, Nanjing, ChinaSchool of Physics and Electronic Information, Jiangsu Second Normal University, Nanjing, ChinaTo address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism. This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures. Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance. Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIR-Fusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection.https://ieeexplore.ieee.org/document/11002868/Transformeroptical remote sensingobject detectionfeature fusion
spellingShingle Wenxuan Zheng
Ying Yang
LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
IEEE Access
Transformer
optical remote sensing
object detection
feature fusion
title LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_full LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_fullStr LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_full_unstemmed LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_short LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_sort lq mixernet a cnn transformer deep fusion based model for object detection in optical remote sensing images
topic Transformer
optical remote sensing
object detection
feature fusion
url https://ieeexplore.ieee.org/document/11002868/
work_keys_str_mv AT wenxuanzheng lqmixernetacnntransformerdeepfusionbasedmodelforobjectdetectioninopticalremotesensingimages
AT yingyang lqmixernetacnntransformerdeepfusionbasedmodelforobjectdetectioninopticalremotesensingimages