LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images

To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep f...

Full description

Saved in:
Bibliographic Details
Main Authors: Wenxuan Zheng, Ying Yang
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11002868/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism. This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures. Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance. Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIR-Fusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection.
ISSN:2169-3536