Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers

Achieving effective and efficient segmentation of human body regions in distorted images is of practical significance. Current methods rely on transformers to extract discriminative features. However, due to the unique global attention mechanism, existing transformers lack detailed image features an...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiao Yu, Yunfeng Hua, Siyun Zhang, Zhaocheng Xu
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10769079/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850263604520026112
author Xiao Yu
Yunfeng Hua
Siyun Zhang
Zhaocheng Xu
author_facet Xiao Yu
Yunfeng Hua
Siyun Zhang
Zhaocheng Xu
author_sort Xiao Yu
collection DOAJ
description Achieving effective and efficient segmentation of human body regions in distorted images is of practical significance. Current methods rely on transformers to extract discriminative features. However, due to the unique global attention mechanism, existing transformers lack detailed image features and incur high computational costs, resulting in subpar segmentation accuracy and slow inference speed. In this paper, we introduce the Human Spatial Prior Module (HSPM) and Dynamic Token Pruning Module (DTPM). The HSPM is specifically designed to capture human features in distorted images, using dynamic methods to extract highly variable details. The DTPM accelerates inference by pruning unimportant tokens from each layer of the Vision Transformer (ViT). Unlike traditional cropping approaches, the cropped tokens are preserved using feature maps and selectively reactivated in subsequent network layers to improve model performance. To validate the effectiveness of Vision Transformer in Distorted Image (ViT-DI), we extend the ADE20K dataset and conduct experiments on the constructed dataset and the Cityscapes dataset. Our method achieves an mIoU increase of 1.6 and an FPS increase of 4.4 on the ADE20K dataset, and an mIoU increase of 0.77 and an FPS increase of 2.9 on the Cityscapes dataset, with a reduction in model size of approximately 130 GFLOPs. The URL to our dataset is: <uri>https://github.com/GitHubYuxiao/ViT-DI</uri>.
format Article
id doaj-art-305d40a2abe64f4f988255308a46b34e
institution OA Journals
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-305d40a2abe64f4f988255308a46b34e2025-08-20T01:54:55ZengIEEEIEEE Access2169-35362024-01-011217897117898110.1109/ACCESS.2024.350727210769079Human Body Segmentation in Wide-Angle Images Based on Fast Vision TransformersXiao Yu0https://orcid.org/0009-0006-1875-9789Yunfeng Hua1Siyun Zhang2https://orcid.org/0009-0009-7353-8235Zhaocheng Xu3School of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, ChinaShining3D Tech Company Ltd., Hangzhou, ChinaSchool of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, ChinaSchool of Mathematical and Computational Sciences, Massey University, Auckland, New ZealandAchieving effective and efficient segmentation of human body regions in distorted images is of practical significance. Current methods rely on transformers to extract discriminative features. However, due to the unique global attention mechanism, existing transformers lack detailed image features and incur high computational costs, resulting in subpar segmentation accuracy and slow inference speed. In this paper, we introduce the Human Spatial Prior Module (HSPM) and Dynamic Token Pruning Module (DTPM). The HSPM is specifically designed to capture human features in distorted images, using dynamic methods to extract highly variable details. The DTPM accelerates inference by pruning unimportant tokens from each layer of the Vision Transformer (ViT). Unlike traditional cropping approaches, the cropped tokens are preserved using feature maps and selectively reactivated in subsequent network layers to improve model performance. To validate the effectiveness of Vision Transformer in Distorted Image (ViT-DI), we extend the ADE20K dataset and conduct experiments on the constructed dataset and the Cityscapes dataset. Our method achieves an mIoU increase of 1.6 and an FPS increase of 4.4 on the ADE20K dataset, and an mIoU increase of 0.77 and an FPS increase of 2.9 on the Cityscapes dataset, with a reduction in model size of approximately 130 GFLOPs. The URL to our dataset is: <uri>https://github.com/GitHubYuxiao/ViT-DI</uri>.https://ieeexplore.ieee.org/document/10769079/Human semantic segmentationwide-angle distorted imageshuman spatial priortoken pruningViT
spellingShingle Xiao Yu
Yunfeng Hua
Siyun Zhang
Zhaocheng Xu
Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers
IEEE Access
Human semantic segmentation
wide-angle distorted images
human spatial prior
token pruning
ViT
title Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers
title_full Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers
title_fullStr Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers
title_full_unstemmed Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers
title_short Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers
title_sort human body segmentation in wide angle images based on fast vision transformers
topic Human semantic segmentation
wide-angle distorted images
human spatial prior
token pruning
ViT
url https://ieeexplore.ieee.org/document/10769079/
work_keys_str_mv AT xiaoyu humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers
AT yunfenghua humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers
AT siyunzhang humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers
AT zhaochengxu humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers