A lightweight mechanism for vision-transformer-based object detection

Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanis...

Full description

Saved in:
Bibliographic Details
Main Authors: Yanming Ye, Qiang Sun, Kailong Cheng, Xingfa Shen, Dongjing Wang
Format: Article
Language:English
Published: Springer 2025-05-01
Series:Complex & Intelligent Systems
Subjects:
Online Access:https://doi.org/10.1007/s40747-025-01904-x
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850113076583464960
author Yanming Ye
Qiang Sun
Kailong Cheng
Xingfa Shen
Dongjing Wang
author_facet Yanming Ye
Qiang Sun
Kailong Cheng
Xingfa Shen
Dongjing Wang
author_sort Yanming Ye
collection DOAJ
description Abstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP $$_\textrm{75}$$ 75 - surpassing DETR by 4.5 AP and 7.9 AP $$_\textrm{75}$$ 75 respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP $$_\textrm{S}$$ S and 43.9 AP $$_\textrm{M}$$ M on COCO 2017, representing 3.3 AP $$_\textrm{S}$$ S and 3.8 AP $$_{\textrm{M}}$$ M improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency.
format Article
id doaj-art-0d256830a6dd43c0a7ef04893be5f3fd
institution OA Journals
issn 2199-4536
2198-6053
language English
publishDate 2025-05-01
publisher Springer
record_format Article
series Complex & Intelligent Systems
spelling doaj-art-0d256830a6dd43c0a7ef04893be5f3fd2025-08-20T02:37:14ZengSpringerComplex & Intelligent Systems2199-45362198-60532025-05-0111711210.1007/s40747-025-01904-xA lightweight mechanism for vision-transformer-based object detectionYanming Ye0Qiang Sun1Kailong Cheng2Xingfa Shen3Dongjing Wang4School of Information Engineering, Hangzhou Dianzi UniversitySchool of Information Engineering, Hangzhou Dianzi UniversitySchool of Computer Science, Hangzhou Dianzi UniversitySchool of Computer Science, Hangzhou Dianzi UniversitySchool of Computer Science, Hangzhou Dianzi UniversityAbstract DETR (DEtection TRansformer) is a CV model for object detection that replaces traditional complex methods with a Transformer architecture, and has achieved significant improvement over previous methods, particularly in handling small and medium-sized objects. However, the attention mechanism-based detection framework of DETR exhibits limitations in small and medium-sized object detection. It struggles to extract fine-grained details of small and medium-sized objects from low-resolution features, and its computational complexity increases significantly with the input scale, thereby constraining real-time detection efficiency. To address these limitations, we introduce the Cross Feature Attention (XFA) mechanism and propose XFCOS (XFA-based with FCOS), a novel object detection model built upon it. XFA simplifies the attention mechanism’s computational process and reduces complexity through L2 normalization and two one-dimensional convolutions applied in different directions. This design reduces the computational complexity from quadratic to linear while preserving spatial context awareness. XFCOS enhances the original TSP-FCOS (Transformer-based Set Prediction with FCOS) model by integrating XFA into the transformer encoder, creating a CNN-ViT hybrid architecture, significantly reducing computational costs without sacrificing accuracy. Extensive experiments demonstrate that XFCOS achieves state-of-the-art performance while addressing DETR’s convergence and efficiency limitations. On Pascal VOC 2007, XFCOS attains 54.7 AP and 60.7 AP $$_\textrm{75}$$ 75 - surpassing DETR by 4.5 AP and 7.9 AP $$_\textrm{75}$$ 75 respectively, establishing new benchmarks among ResNet-50-based detectors. The model shows particular strength in small object detection, achieving 24.0 AP $$_\textrm{S}$$ S and 43.9 AP $$_\textrm{M}$$ M on COCO 2017, representing 3.3 AP $$_\textrm{S}$$ S and 3.8 AP $$_{\textrm{M}}$$ M improvements over DETR. Through computational optimization, XFCOS reduces encoder FLOPs to 13.5G, representing a 17.2% decrease compared to TSP-FCOS’s 16.3G, while cutting activation memory from 285.78 to 264.64M, a reduction of 7.4%. This significantly enhances computational efficiency.https://doi.org/10.1007/s40747-025-01904-xObject detectionDETRXFAXFCOSCNN-ViT
spellingShingle Yanming Ye
Qiang Sun
Kailong Cheng
Xingfa Shen
Dongjing Wang
A lightweight mechanism for vision-transformer-based object detection
Complex & Intelligent Systems
Object detection
DETR
XFA
XFCOS
CNN-ViT
title A lightweight mechanism for vision-transformer-based object detection
title_full A lightweight mechanism for vision-transformer-based object detection
title_fullStr A lightweight mechanism for vision-transformer-based object detection
title_full_unstemmed A lightweight mechanism for vision-transformer-based object detection
title_short A lightweight mechanism for vision-transformer-based object detection
title_sort lightweight mechanism for vision transformer based object detection
topic Object detection
DETR
XFA
XFCOS
CNN-ViT
url https://doi.org/10.1007/s40747-025-01904-x
work_keys_str_mv AT yanmingye alightweightmechanismforvisiontransformerbasedobjectdetection
AT qiangsun alightweightmechanismforvisiontransformerbasedobjectdetection
AT kailongcheng alightweightmechanismforvisiontransformerbasedobjectdetection
AT xingfashen alightweightmechanismforvisiontransformerbasedobjectdetection
AT dongjingwang alightweightmechanismforvisiontransformerbasedobjectdetection
AT yanmingye lightweightmechanismforvisiontransformerbasedobjectdetection
AT qiangsun lightweightmechanismforvisiontransformerbasedobjectdetection
AT kailongcheng lightweightmechanismforvisiontransformerbasedobjectdetection
AT xingfashen lightweightmechanismforvisiontransformerbasedobjectdetection
AT dongjingwang lightweightmechanismforvisiontransformerbasedobjectdetection