RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model
RGB-thermal object detection harnesses complementary information from visible and thermal modalities to enhance detection robustness in challenging environments, particularly under low-light conditions. However, existing approaches suffer from limitations due to their heavy dependence on precisely r...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Entropy |
| Subjects: | |
| Online Access: | https://www.mdpi.com/1099-4300/27/4/442 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849713658764984320 |
|---|---|
| author | Yubin Li Weida Zhan Yichun Jiang Jinxin Guo |
| author_facet | Yubin Li Weida Zhan Yichun Jiang Jinxin Guo |
| author_sort | Yubin Li |
| collection | DOAJ |
| description | RGB-thermal object detection harnesses complementary information from visible and thermal modalities to enhance detection robustness in challenging environments, particularly under low-light conditions. However, existing approaches suffer from limitations due to their heavy dependence on precisely registered data and insufficient handling of cross-modal distribution disparities. This paper presents RDCRNet, a novel framework incorporating a Cross-Modal Representation Model to effectively address these challenges. The proposed network features a Cross-Modal Feature Remapping Module that aligns modality distributions through statistical normalization and learnable correction parameters, significantly reducing feature discrepancies between modalities. A Cross-Modal Refinement and Interaction Module enables sophisticated bidirectional information exchange via trinity refinement for intra-modal context modeling and cross-attention mechanisms for unaligned feature fusion. Multiscale detection capability is enhanced through a Cross-Scale Feature Integration Module, improving detection performance across various object sizes. To overcome the inherent data scarcity in RGB-T detection, we introduce a self-supervised pretraining strategy that combines masked reconstruction with adversarial learning and semantic consistency loss, effectively leveraging both aligned and unaligned RGB-T samples. Extensive experiments demonstrate that RDCRNet achieves state-of-the-art performance on multiple benchmark datasets while maintaining high computational and storage efficiency, validating its superiority and practical effectiveness in real-world applications. |
| format | Article |
| id | doaj-art-165daba04a964a438f6fd04063dfe2a4 |
| institution | DOAJ |
| issn | 1099-4300 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Entropy |
| spelling | doaj-art-165daba04a964a438f6fd04063dfe2a42025-08-20T03:13:54ZengMDPI AGEntropy1099-43002025-04-0127444210.3390/e27040442RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation ModelYubin Li0Weida Zhan1Yichun Jiang2Jinxin Guo3The College of Electronic and Information Engineering, Changchun University of Science and Technology, Changchun 130022, ChinaThe College of Electronic and Information Engineering, Changchun University of Science and Technology, Changchun 130022, ChinaThe College of Electronic and Information Engineering, Changchun University of Science and Technology, Changchun 130022, ChinaThe College of Electronic and Information Engineering, Changchun University of Science and Technology, Changchun 130022, ChinaRGB-thermal object detection harnesses complementary information from visible and thermal modalities to enhance detection robustness in challenging environments, particularly under low-light conditions. However, existing approaches suffer from limitations due to their heavy dependence on precisely registered data and insufficient handling of cross-modal distribution disparities. This paper presents RDCRNet, a novel framework incorporating a Cross-Modal Representation Model to effectively address these challenges. The proposed network features a Cross-Modal Feature Remapping Module that aligns modality distributions through statistical normalization and learnable correction parameters, significantly reducing feature discrepancies between modalities. A Cross-Modal Refinement and Interaction Module enables sophisticated bidirectional information exchange via trinity refinement for intra-modal context modeling and cross-attention mechanisms for unaligned feature fusion. Multiscale detection capability is enhanced through a Cross-Scale Feature Integration Module, improving detection performance across various object sizes. To overcome the inherent data scarcity in RGB-T detection, we introduce a self-supervised pretraining strategy that combines masked reconstruction with adversarial learning and semantic consistency loss, effectively leveraging both aligned and unaligned RGB-T samples. Extensive experiments demonstrate that RDCRNet achieves state-of-the-art performance on multiple benchmark datasets while maintaining high computational and storage efficiency, validating its superiority and practical effectiveness in real-world applications.https://www.mdpi.com/1099-4300/27/4/442object detectionmultimodalcross-modal representationpretraining |
| spellingShingle | Yubin Li Weida Zhan Yichun Jiang Jinxin Guo RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model Entropy object detection multimodal cross-modal representation pretraining |
| title | RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model |
| title_full | RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model |
| title_fullStr | RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model |
| title_full_unstemmed | RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model |
| title_short | RDCRNet: RGB-T Object Detection Network Based on Cross-Modal Representation Model |
| title_sort | rdcrnet rgb t object detection network based on cross modal representation model |
| topic | object detection multimodal cross-modal representation pretraining |
| url | https://www.mdpi.com/1099-4300/27/4/442 |
| work_keys_str_mv | AT yubinli rdcrnetrgbtobjectdetectionnetworkbasedoncrossmodalrepresentationmodel AT weidazhan rdcrnetrgbtobjectdetectionnetworkbasedoncrossmodalrepresentationmodel AT yichunjiang rdcrnetrgbtobjectdetectionnetworkbasedoncrossmodalrepresentationmodel AT jinxinguo rdcrnetrgbtobjectdetectionnetworkbasedoncrossmodalrepresentationmodel |