Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibi...
Saved in:
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10833840/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823857144076697600 |
---|---|
author | Xiangqing Zhang Yan Feng Nan Wang Guohua Lu Shaohui Mei |
author_facet | Xiangqing Zhang Yan Feng Nan Wang Guohua Lu Shaohui Mei |
author_sort | Xiangqing Zhang |
collection | DOAJ |
description | Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy–paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery. |
format | Article |
id | doaj-art-8ca09ab0048c44d48bda9b81d1fae311 |
institution | Kabale University |
issn | 1939-1404 2151-1535 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing |
spelling | doaj-art-8ca09ab0048c44d48bda9b81d1fae3112025-02-12T00:00:49ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-01185082509910.1109/JSTARS.2025.352699510833840Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR DatasetXiangqing Zhang0https://orcid.org/0000-0001-7273-6170Yan Feng1https://orcid.org/0000-0002-0669-9970Nan Wang2https://orcid.org/0000-0001-8739-6711Guohua Lu3https://orcid.org/0000-0001-6421-0232Shaohui Mei4https://orcid.org/0000-0002-8018-596XSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaDepartment of Military Biomedical Engineering, Air Force Medical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaAerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy–paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.https://ieeexplore.ieee.org/document/10833840/Aerial-based person detectionbimodality transformerinstance segmentation for copy–paste (ISCP)VTSaR dataset |
spellingShingle | Xiangqing Zhang Yan Feng Nan Wang Guohua Lu Shaohui Mei Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Aerial-based person detection bimodality transformer instance segmentation for copy–paste (ISCP) VTSaR dataset |
title | Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset |
title_full | Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset |
title_fullStr | Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset |
title_full_unstemmed | Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset |
title_short | Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset |
title_sort | transformer based person detection in paired rgb t aerial images with vtsar dataset |
topic | Aerial-based person detection bimodality transformer instance segmentation for copy–paste (ISCP) VTSaR dataset |
url | https://ieeexplore.ieee.org/document/10833840/ |
work_keys_str_mv | AT xiangqingzhang transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT yanfeng transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT nanwang transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT guohualu transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT shaohuimei transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset |