Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset

Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibi...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiangqing Zhang, Yan Feng, Nan Wang, Guohua Lu, Shaohui Mei
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10833840/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823857144076697600
author Xiangqing Zhang
Yan Feng
Nan Wang
Guohua Lu
Shaohui Mei
author_facet Xiangqing Zhang
Yan Feng
Nan Wang
Guohua Lu
Shaohui Mei
author_sort Xiangqing Zhang
collection DOAJ
description Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy–paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.
format Article
id doaj-art-8ca09ab0048c44d48bda9b81d1fae311
institution Kabale University
issn 1939-1404
2151-1535
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling doaj-art-8ca09ab0048c44d48bda9b81d1fae3112025-02-12T00:00:49ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-01185082509910.1109/JSTARS.2025.352699510833840Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR DatasetXiangqing Zhang0https://orcid.org/0000-0001-7273-6170Yan Feng1https://orcid.org/0000-0002-0669-9970Nan Wang2https://orcid.org/0000-0001-8739-6711Guohua Lu3https://orcid.org/0000-0001-6421-0232Shaohui Mei4https://orcid.org/0000-0002-8018-596XSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaDepartment of Military Biomedical Engineering, Air Force Medical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaAerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy–paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.https://ieeexplore.ieee.org/document/10833840/Aerial-based person detectionbimodality transformerinstance segmentation for copy–paste (ISCP)VTSaR dataset
spellingShingle Xiangqing Zhang
Yan Feng
Nan Wang
Guohua Lu
Shaohui Mei
Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Aerial-based person detection
bimodality transformer
instance segmentation for copy–paste (ISCP)
VTSaR dataset
title Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_full Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_fullStr Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_full_unstemmed Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_short Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_sort transformer based person detection in paired rgb t aerial images with vtsar dataset
topic Aerial-based person detection
bimodality transformer
instance segmentation for copy–paste (ISCP)
VTSaR dataset
url https://ieeexplore.ieee.org/document/10833840/
work_keys_str_mv AT xiangqingzhang transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset
AT yanfeng transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset
AT nanwang transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset
AT guohualu transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset
AT shaohuimei transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset