Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset

Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xiangqing Zhang, Yan Feng, Nan Wang, Guohua Lu, Shaohui Mei
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Subjects:	Aerial-based person detection bimodality transformer instance segmentation for copy–paste (ISCP) VTSaR dataset
Online Access:	https://ieeexplore.ieee.org/document/10833840/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823857144076697600
author	Xiangqing Zhang Yan Feng Nan Wang Guohua Lu Shaohui Mei
author_facet	Xiangqing Zhang Yan Feng Nan Wang Guohua Lu Shaohui Mei
author_sort	Xiangqing Zhang
collection	DOAJ
description	Aerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy–paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.
format	Article
id	doaj-art-8ca09ab0048c44d48bda9b81d1fae311
institution	Kabale University
issn	1939-1404 2151-1535
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
spelling	doaj-art-8ca09ab0048c44d48bda9b81d1fae3112025-02-12T00:00:49ZengIEEEIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing1939-14042151-15352025-01-01185082509910.1109/JSTARS.2025.352699510833840Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR DatasetXiangqing Zhang0https://orcid.org/0000-0001-7273-6170Yan Feng1https://orcid.org/0000-0002-0669-9970Nan Wang2https://orcid.org/0000-0001-8739-6711Guohua Lu3https://orcid.org/0000-0001-6421-0232Shaohui Mei4https://orcid.org/0000-0002-8018-596XSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaDepartment of Military Biomedical Engineering, Air Force Medical University, Xi'an, ChinaSchool of Electronics and Information, Northwestern Polytechnical University, Xi'an, ChinaAerial-based person detection poses a significant challenge, yet it is crucial for real-world applications like air-ground linkage search and all-weather intelligent corescuing. However, existing person detection models designed for aerial images heavily rely on numerous labeled instances and exhibit limited tolerance towards complex lighting conditions commonly encountered in search and rescue (SaR) scenarios. This article presents the visible-thermal from SaR scenarios for person detection network (VTSaRNet) to address the challenge of detecting persons situated sparsely in SaR scenes marked by intricate illumination conditions and restricted accessibility. VTSaRNet integrates the instance segmentation for copy–paste mechanism (ISCP) using a Union Transformer Network that functions in both Visible (V) and Thermal (T) bimodalities. Specifically, This study employs synthetic samples obtained through offline Mosaic augmentation by oversampling the local area of bulk images. Then, it utilizes the ISCP module to extract accurate boundaries of personnel instances from complex backgrounds. VTSaRNet cross-integrates the global features and encodes the correlations between two modalities through the multihead attention module. It also adaptively recalibrates the channel responses of partial feature maps for fusion operations with the transformer module in conjunction with anchor-based detectors. Moreover, the adaptation scheme is constructed with multiple strategies to effectively handle various scenarios involving persons, and the entire network is trained end-to-end. Extensive experiments conducted on the Heridal and VTSaR datasets demonstrate the effectiveness of light-weighted VTSaRNet in achieving impressive metrics precision of 98.3%, recall of 96.78%, mAP@0.5 of 98.73%, and mAP@0.5:0.95 of 73.98% under self-built VTSaR dataset, respectively). This performance sets a new benchmark in person detection from aerial imagery.https://ieeexplore.ieee.org/document/10833840/Aerial-based person detectionbimodality transformerinstance segmentation for copy–paste (ISCP)VTSaR dataset
spellingShingle	Xiangqing Zhang Yan Feng Nan Wang Guohua Lu Shaohui Mei Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing Aerial-based person detection bimodality transformer instance segmentation for copy–paste (ISCP) VTSaR dataset
title	Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_full	Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_fullStr	Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_full_unstemmed	Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_short	Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset
title_sort	transformer based person detection in paired rgb t aerial images with vtsar dataset
topic	Aerial-based person detection bimodality transformer instance segmentation for copy–paste (ISCP) VTSaR dataset
url	https://ieeexplore.ieee.org/document/10833840/
work_keys_str_mv	AT xiangqingzhang transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT yanfeng transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT nanwang transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT guohualu transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset AT shaohuimei transformerbasedpersondetectioninpairedrgbtaerialimageswithvtsardataset

Transformer-Based Person Detection in Paired RGB-T Aerial Images With VTSaR Dataset

Similar Items