Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection

ObjectivePedestrian detection is a crucial task in computer vision, especially in applications like autonomous driving, robot navigation, and intelligent surveillance. However, pedestrian occlusion in real-world scenarios remains a significant challenge. Occlusion leads to a sharp reduction in the v...

Full description

Saved in:

Bibliographic Details
Main Authors:	ZHENG Kaikui, JI Kangyou, LI Jun, LI Qiming
Format:	Article
Language:	English
Published:	Editorial Department of Journal of Sichuan University (Engineering Science Edition) 2025-01-01
Series:	工程科学与技术
Subjects:	pedestrian detection mamba feature enhancement CBAM
Online Access:	http://jsuese.scu.edu.cn/thesisDetails#10.12454/j.jsuese.202401025
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849387915341201408
author	ZHENG Kaikui JI Kangyou LI Jun LI Qiming
author_facet	ZHENG Kaikui JI Kangyou LI Jun LI Qiming
author_sort	ZHENG Kaikui
collection	DOAJ
description	ObjectivePedestrian detection is a crucial task in computer vision, especially in applications like autonomous driving, robot navigation, and intelligent surveillance. However, pedestrian occlusion in real-world scenarios remains a significant challenge. Occlusion leads to a sharp reduction in the visible range of targets and a substantial loss of pedestrian features, making it difficult for detectors to distinguish between targets and pedestrians effectively. Existing methods, including post-processing optimization, specific model based improvements, and body part feature based methods, have limitations such as inaccurate handling of heavily occluded positive samples, high computational complexity, and susceptible to background noise. Therefore, developing a more effective method to address pedestrian occlusion detection is essential for enhancing the performance of pedestrian detectors.MethodsProposed the Global Feature Focusing and Information Enhancement Network (GFFIE-Net). The method starts with feature extraction employ HRNet-W32 as the backbone network to generate multi-scale feature maps with different resolutions (1/4, 1/8, 1/16, and 1/32 of the input image). These feature maps capture both high-level semantic information and low-level spatial details, which are essential for detecting pedestrians in complex scenes. To enhance the feature representation and reduce background noise interference, the Convolutional Block Attention Module (CBAM) is embedded after the feature maps. CBAM adjusts the importance of each channel and spatial location in the feature maps through operations like global average pooling, maxpooling, and small fully connected neural networks in both channel and spatial attention dimensions. This process strengthens the feature information in key areas and suppresses background noise, enabling the network to focus on the target area. Subsequently, considering the limitations of CNN-based methods in global information extraction, the Mamba module is cascaded after the CBAM. The Mamba module first flattens the feature maps into one dimensional image patch vectors and then uses linear layers for feature extraction and transformation. It captures global context information and long range dependencies between feature vectors through forward and backward processing with the State Space Model (SSM). This helps in extracting the contextual information around occluded pedestrians and inferring the complete pedestrian features based on visible ones. Finally, a hierarchical feature fusion mechanism is designed. This mechanism first uses the bilinear interpolation algorithm to adjust the spatial resolution of different scale feature maps to be consistent. Then, it concatenates the three high dimensional and low resolution feature maps of rich in semantic information in the channel dimension to enhance the deep semantic information representation. After that, it combines the preliminarily fused feature map with the low dimensional and high resolution feature map containing more detailed location information in the channel dimension. This achieves a comprehensive fusion of high level semantic and positional detail information, enabling the algorithm to capture multi-level semantic information. The final feature map is processed by a detection head, which generates center heatmaps, scale heatmaps, and offset maps to predict pedestrian bounding boxes.Results and DiscussionsIn order to comprehensively verify the effectiveness of the various improvements of the proposed GFFIE-Net, we designed ablation experiments from four aspects. First, investigated the effects of different global information extraction methods on the experimental results; second, analyzed the effects of different modules on the network effects; third, explored the impact of different scales on network performance, sequential cascade structure, and rationalization of hierarchical feature fusion; and fourth, verified the robustness of the enhancement modules designed by testing them on different backbone networks. Extensive experiments are then conducted on three challenging pedestrian datasets: CityPersons, Caltech, and CrowdHuman. and the experimental results show that the <italic>MR<sup>-2</sup></italic> metric reaches 43.7% on the heavy occluded subset of the CityPersons dataset, which is an improvement of 4.4% compared to the baseline method; 33.6% on the heavy occluded subset of the Caltech; on the CrowdHuman dataset, the <italic>MR<sup>-2</sup></italic> metric reached 43.2%, outperforming some mainstream methods. Finally, conducted a visualization analysis of the detection boxes and central heatmaps. Selected seven representative practical scene images from the three datasets, including traffic, intersection video surveillance, nighttime, traffic dense, strong light, small target, and crowded pedestrian scenes. The results showed that compared with the baseline network, GFFIE-Net had more significant central responses and more accurate detection box positioning for occluded pedestrians. In the traffic dense scene, for example, when multiple pedestrians were occluded by each other, the baseline network had a large number of undetected pedestrians, and the central heatmap had a weak response to occluded pedestrians. In contrast, GFFIE-Net could accurately identify and locate occluded pedestrians. This indicates that GFFIE-Net can effectively handle occluded pedestrians in various scenarios, demonstrating its strong adaptability and high performance detection ability.ConclusionsThe proposed GFFIE-Net, by integrating CBAM module, Mamba module, and hierarchical feature fusion mechanisms, GFFIE-Net effectively addresses the challenges of feature loss and background noise in occluded scenarios. The experimental results on three benchmark datasets demonstrate the superiority of GFFIE-Net over existing methods, particularly in handling heavily occluded pedestrians.OtherFuture research could explore semi-supervised or self-supervised learning using limited labeled data. This method can reduce the dependence on large scale labeled datasets, improve the model's generalization ability, and further enhance the method's applicability and accuracy in different scenarios.
format	Article
id	doaj-art-843fcda072754aff8984fdf2e0eb3658
institution	Kabale University
issn	2096-3246
language	English
publishDate	2025-01-01
publisher	Editorial Department of Journal of Sichuan University (Engineering Science Edition)
record_format	Article
series	工程科学与技术
spelling	doaj-art-843fcda072754aff8984fdf2e0eb36582025-08-20T03:42:26ZengEditorial Department of Journal of Sichuan University (Engineering Science Edition)工程科学与技术2096-32462025-01-0111787949126Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian DetectionZHENG KaikuiJI KangyouLI JunLI QimingObjectivePedestrian detection is a crucial task in computer vision, especially in applications like autonomous driving, robot navigation, and intelligent surveillance. However, pedestrian occlusion in real-world scenarios remains a significant challenge. Occlusion leads to a sharp reduction in the visible range of targets and a substantial loss of pedestrian features, making it difficult for detectors to distinguish between targets and pedestrians effectively. Existing methods, including post-processing optimization, specific model based improvements, and body part feature based methods, have limitations such as inaccurate handling of heavily occluded positive samples, high computational complexity, and susceptible to background noise. Therefore, developing a more effective method to address pedestrian occlusion detection is essential for enhancing the performance of pedestrian detectors.MethodsProposed the Global Feature Focusing and Information Enhancement Network (GFFIE-Net). The method starts with feature extraction employ HRNet-W32 as the backbone network to generate multi-scale feature maps with different resolutions (1/4, 1/8, 1/16, and 1/32 of the input image). These feature maps capture both high-level semantic information and low-level spatial details, which are essential for detecting pedestrians in complex scenes. To enhance the feature representation and reduce background noise interference, the Convolutional Block Attention Module (CBAM) is embedded after the feature maps. CBAM adjusts the importance of each channel and spatial location in the feature maps through operations like global average pooling, maxpooling, and small fully connected neural networks in both channel and spatial attention dimensions. This process strengthens the feature information in key areas and suppresses background noise, enabling the network to focus on the target area. Subsequently, considering the limitations of CNN-based methods in global information extraction, the Mamba module is cascaded after the CBAM. The Mamba module first flattens the feature maps into one dimensional image patch vectors and then uses linear layers for feature extraction and transformation. It captures global context information and long range dependencies between feature vectors through forward and backward processing with the State Space Model (SSM). This helps in extracting the contextual information around occluded pedestrians and inferring the complete pedestrian features based on visible ones. Finally, a hierarchical feature fusion mechanism is designed. This mechanism first uses the bilinear interpolation algorithm to adjust the spatial resolution of different scale feature maps to be consistent. Then, it concatenates the three high dimensional and low resolution feature maps of rich in semantic information in the channel dimension to enhance the deep semantic information representation. After that, it combines the preliminarily fused feature map with the low dimensional and high resolution feature map containing more detailed location information in the channel dimension. This achieves a comprehensive fusion of high level semantic and positional detail information, enabling the algorithm to capture multi-level semantic information. The final feature map is processed by a detection head, which generates center heatmaps, scale heatmaps, and offset maps to predict pedestrian bounding boxes.Results and DiscussionsIn order to comprehensively verify the effectiveness of the various improvements of the proposed GFFIE-Net, we designed ablation experiments from four aspects. First, investigated the effects of different global information extraction methods on the experimental results; second, analyzed the effects of different modules on the network effects; third, explored the impact of different scales on network performance, sequential cascade structure, and rationalization of hierarchical feature fusion; and fourth, verified the robustness of the enhancement modules designed by testing them on different backbone networks. Extensive experiments are then conducted on three challenging pedestrian datasets: CityPersons, Caltech, and CrowdHuman. and the experimental results show that the <italic>MR<sup>-2</sup></italic> metric reaches 43.7% on the heavy occluded subset of the CityPersons dataset, which is an improvement of 4.4% compared to the baseline method; 33.6% on the heavy occluded subset of the Caltech; on the CrowdHuman dataset, the <italic>MR<sup>-2</sup></italic> metric reached 43.2%, outperforming some mainstream methods. Finally, conducted a visualization analysis of the detection boxes and central heatmaps. Selected seven representative practical scene images from the three datasets, including traffic, intersection video surveillance, nighttime, traffic dense, strong light, small target, and crowded pedestrian scenes. The results showed that compared with the baseline network, GFFIE-Net had more significant central responses and more accurate detection box positioning for occluded pedestrians. In the traffic dense scene, for example, when multiple pedestrians were occluded by each other, the baseline network had a large number of undetected pedestrians, and the central heatmap had a weak response to occluded pedestrians. In contrast, GFFIE-Net could accurately identify and locate occluded pedestrians. This indicates that GFFIE-Net can effectively handle occluded pedestrians in various scenarios, demonstrating its strong adaptability and high performance detection ability.ConclusionsThe proposed GFFIE-Net, by integrating CBAM module, Mamba module, and hierarchical feature fusion mechanisms, GFFIE-Net effectively addresses the challenges of feature loss and background noise in occluded scenarios. The experimental results on three benchmark datasets demonstrate the superiority of GFFIE-Net over existing methods, particularly in handling heavily occluded pedestrians.OtherFuture research could explore semi-supervised or self-supervised learning using limited labeled data. This method can reduce the dependence on large scale labeled datasets, improve the model's generalization ability, and further enhance the method's applicability and accuracy in different scenarios.http://jsuese.scu.edu.cn/thesisDetails#10.12454/j.jsuese.202401025pedestrian detectionmambafeature enhancementCBAM
spellingShingle	ZHENG Kaikui JI Kangyou LI Jun LI Qiming Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection 工程科学与技术 pedestrian detection mamba feature enhancement CBAM
title	Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection
title_full	Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection
title_fullStr	Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection
title_full_unstemmed	Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection
title_short	Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection
title_sort	global feature focusing and information enhancement network for occluded pedestrian detection
topic	pedestrian detection mamba feature enhancement CBAM
url	http://jsuese.scu.edu.cn/thesisDetails#10.12454/j.jsuese.202401025
work_keys_str_mv	AT zhengkaikui globalfeaturefocusingandinformationenhancementnetworkforoccludedpedestriandetection AT jikangyou globalfeaturefocusingandinformationenhancementnetworkforoccludedpedestriandetection AT lijun globalfeaturefocusingandinformationenhancementnetworkforoccludedpedestriandetection AT liqiming globalfeaturefocusingandinformationenhancementnetworkforoccludedpedestriandetection

Global Feature Focusing and Information Enhancement Network for Occluded Pedestrian Detection

Similar Items