LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images

To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep f...

Full description

Saved in:

Bibliographic Details
Main Authors:	Wenxuan Zheng, Ying Yang
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Transformer optical remote sensing object detection feature fusion
Online Access:	https://ieeexplore.ieee.org/document/11002868/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849761342403117056
author	Wenxuan Zheng Ying Yang
author_facet	Wenxuan Zheng Ying Yang
author_sort	Wenxuan Zheng
collection	DOAJ
description	To address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism. This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures. Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance. Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIR-Fusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection.
format	Article
id	doaj-art-a6b49422b83e41ce85583d54de290c6b
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-a6b49422b83e41ce85583d54de290c6b2025-08-20T03:06:04ZengIEEEIEEE Access2169-35362025-01-0113913479136110.1109/ACCESS.2025.356965811002868LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing ImagesWenxuan Zheng0https://orcid.org/0009-0008-7351-480XYing Yang1https://orcid.org/0009-0004-7561-0329School of Physics and Electronic Information, Jiangsu Second Normal University, Nanjing, ChinaSchool of Physics and Electronic Information, Jiangsu Second Normal University, Nanjing, ChinaTo address the challenges of low detection accuracy in optical remote sensing images (RSIs) caused by densely distributed targets, extreme scale variations, and insufficient feature representation of small objects, this paper proposes LQ-MixerNeT, a novel CNN-Transformer hybrid framework with deep fusion capabilities. The core innovation lies in the DMIR-Fusion feature integrator, which integrates two key components: the DMI-DWConv Module and the ReLU Linear Attention (RLA) mechanism. This feature integrator dynamically coordinates local high-frequency feature from CNNs and global low-frequency feature from Transformers, effectively overcoming the inherent limitations of unimodal architectures. Furthermore, a frequency-ramping structure is introduced to dynamically regulate the high-frequency/low-frequency information allocation ratio in the DMIR-Fusion integrator across different feature extraction stages through four-channel scaling ratios (1/2, 1/4, 1/8, 1/16). The framework also incorporates an enhanced Asymptotic Feature Pyramid Network (AFPN) and Coordinate Attention (CA) mechanisms, synergistically optimizing spatial-semantic alignment and multi-scale feature representation, thereby significantly improving feature extraction performance. Extensive experiments on three benchmark RSIs datasets (RSOD, NWPU VHR-10 and DIOR) validate the superiority of LQ-MixerNeT. Results demonstrate that our method achieves mAP@0.5 scores of 73.65%, 85.29%, and 83.25%, respectively. Ablation studies reveal that the DMIR-Fusion integrator contributes a 2.02% accuracy improvement, while the enhanced AFPN boosts performance by 0.54%. These findings highlight the model’s robustness in handling complex RSIs scenarios and establish a new paradigm for multimodal fusion in remote sensing object detection.https://ieeexplore.ieee.org/document/11002868/Transformeroptical remote sensingobject detectionfeature fusion
spellingShingle	Wenxuan Zheng Ying Yang LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images IEEE Access Transformer optical remote sensing object detection feature fusion
title	LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_full	LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_fullStr	LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_full_unstemmed	LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_short	LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images
title_sort	lq mixernet a cnn transformer deep fusion based model for object detection in optical remote sensing images
topic	Transformer optical remote sensing object detection feature fusion
url	https://ieeexplore.ieee.org/document/11002868/
work_keys_str_mv	AT wenxuanzheng lqmixernetacnntransformerdeepfusionbasedmodelforobjectdetectioninopticalremotesensingimages AT yingyang lqmixernetacnntransformerdeepfusionbasedmodelforobjectdetectioninopticalremotesensingimages

LQ-MixerNeT: A CNN-Transformer Deep Fusion-Based Model for Object Detection in Optical Remote Sensing Images

Similar Items