A lightweight mechanism for vision-transformer-based object detection
Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanis...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Springer
2025-05-01
|
| Series: | Complex & Intelligent Systems |
| Subjects: | |
| Online Access: | https://doi.org/10.1007/s40747-025-01904-x |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850113076583464960 |
|---|---|
| author | Yanming Ye Qiang Sun Kailong Cheng Xingfa Shen Dongjing Wang |
| author_facet | Yanming Ye Qiang Sun Kailong Cheng Xingfa Shen Dongjing Wang |
| author_sort | Yanming Ye |
| collection | DOAJ |
| description | Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP $$_\textrm{75}$$ 75 - surpassing DETR by 4.5 AP and 7.9 AP $$_\textrm{75}$$ 75 respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP $$_\textrm{S}$$ S and 43.9 AP $$_\textrm{M}$$ M on COCO 2017, representing 3.3 AP $$_\textrm{S}$$ S and 3.8 AP $$_{\textrm{M}}$$ M improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency. |
| format | Article |
| id | doaj-art-0d256830a6dd43c0a7ef04893be5f3fd |
| institution | OA Journals |
| issn | 2199-4536 2198-6053 |
| language | English |
| publishDate | 2025-05-01 |
| publisher | Springer |
| record_format | Article |
| series | Complex & Intelligent Systems |
| spelling | doaj-art-0d256830a6dd43c0a7ef04893be5f3fd2025-08-20T02:37:14ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-05-0111711210.1007/s40747-025-01904-xA lightweight mechanism for vision-transformer-based object detectionYanming Ye0Qiang Sun1Kailong Cheng2Xingfa Shen3Dongjing Wang4School of Information Engineering, Hangzhou Dianzi UniversitySchool of Information Engineering, Hangzhou Dianzi UniversitySchool of Computer Science, Hangzhou Dianzi UniversitySchool of Computer Science, Hangzhou Dianzi UniversitySchool of Computer Science, Hangzhou Dianzi UniversityAbstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP $$_\textrm{75}$$ 75 - surpassing DETR by 4.5 AP and 7.9 AP $$_\textrm{75}$$ 75 respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP $$_\textrm{S}$$ S and 43.9 AP $$_\textrm{M}$$ M on COCO 2017, representing 3.3 AP $$_\textrm{S}$$ S and 3.8 AP $$_{\textrm{M}}$$ M improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency.https://doi.org/10.1007/s40747-025-01904-xObject detectionDETRXFAXFCOSCNN-ViT |
| spellingShingle | Yanming Ye Qiang Sun Kailong Cheng Xingfa Shen Dongjing Wang A lightweight mechanism for vision-transformer-based object detection Complex & Intelligent Systems Object detection DETR XFA XFCOS CNN-ViT |
| title | A lightweight mechanism for vision-transformer-based object detection |
| title_full | A lightweight mechanism for vision-transformer-based object detection |
| title_fullStr | A lightweight mechanism for vision-transformer-based object detection |
| title_full_unstemmed | A lightweight mechanism for vision-transformer-based object detection |
| title_short | A lightweight mechanism for vision-transformer-based object detection |
| title_sort | lightweight mechanism for vision transformer based object detection |
| topic | Object detection DETR XFA XFCOS CNN-ViT |
| url | https://doi.org/10.1007/s40747-025-01904-x |
| work_keys_str_mv | AT yanmingye alightweightmechanismforvisiontransformerbasedobjectdetection AT qiangsun alightweightmechanismforvisiontransformerbasedobjectdetection AT kailongcheng alightweightmechanismforvisiontransformerbasedobjectdetection AT xingfashen alightweightmechanismforvisiontransformerbasedobjectdetection AT dongjingwang alightweightmechanismforvisiontransformerbasedobjectdetection AT yanmingye lightweightmechanismforvisiontransformerbasedobjectdetection AT qiangsun lightweightmechanismforvisiontransformerbasedobjectdetection AT kailongcheng lightweightmechanismforvisiontransformerbasedobjectdetection AT xingfashen lightweightmechanismforvisiontransformerbasedobjectdetection AT dongjingwang lightweightmechanismforvisiontransformerbasedobjectdetection |