Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers
Achieving effective and efficient segmentation of human body regions in distorted images is of practical significance. Current methods rely on transformers to extract discriminative features. However, due to the unique global attention mechanism, existing transformers lack detailed image features an...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2024-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10769079/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850263604520026112 |
|---|---|
| author | Xiao Yu Yunfeng Hua Siyun Zhang Zhaocheng Xu |
| author_facet | Xiao Yu Yunfeng Hua Siyun Zhang Zhaocheng Xu |
| author_sort | Xiao Yu |
| collection | DOAJ |
| description | Achieving effective and efficient segmentation of human body regions in distorted images is of practical significance. Current methods rely on transformers to extract discriminative features. However, due to the unique global attention mechanism, existing transformers lack detailed image features and incur high computational costs, resulting in subpar segmentation accuracy and slow inference speed. In this paper, we introduce the Human Spatial Prior Module (HSPM) and Dynamic Token Pruning Module (DTPM). The HSPM is specifically designed to capture human features in distorted images, using dynamic methods to extract highly variable details. The DTPM accelerates inference by pruning unimportant tokens from each layer of the Vision Transformer (ViT). Unlike traditional cropping approaches, the cropped tokens are preserved using feature maps and selectively reactivated in subsequent network layers to improve model performance. To validate the effectiveness of Vision Transformer in Distorted Image (ViT-DI), we extend the ADE20K dataset and conduct experiments on the constructed dataset and the Cityscapes dataset. Our method achieves an mIoU increase of 1.6 and an FPS increase of 4.4 on the ADE20K dataset, and an mIoU increase of 0.77 and an FPS increase of 2.9 on the Cityscapes dataset, with a reduction in model size of approximately 130 GFLOPs. The URL to our dataset is: <uri>https://github.com/GitHubYuxiao/ViT-DI</uri>. |
| format | Article |
| id | doaj-art-305d40a2abe64f4f988255308a46b34e |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-305d40a2abe64f4f988255308a46b34e2025-08-20T01:54:55ZengIEEEIEEE Access2169-35362024-01-011217897117898110.1109/ACCESS.2024.350727210769079Human Body Segmentation in Wide-Angle Images Based on Fast Vision TransformersXiao Yu0https://orcid.org/0009-0006-1875-9789Yunfeng Hua1Siyun Zhang2https://orcid.org/0009-0009-7353-8235Zhaocheng Xu3School of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, ChinaShining3D Tech Company Ltd., Hangzhou, ChinaSchool of Computer Science and Technology, Zhejiang Gongshang University, Hangzhou, ChinaSchool of Mathematical and Computational Sciences, Massey University, Auckland, New ZealandAchieving effective and efficient segmentation of human body regions in distorted images is of practical significance. Current methods rely on transformers to extract discriminative features. However, due to the unique global attention mechanism, existing transformers lack detailed image features and incur high computational costs, resulting in subpar segmentation accuracy and slow inference speed. In this paper, we introduce the Human Spatial Prior Module (HSPM) and Dynamic Token Pruning Module (DTPM). The HSPM is specifically designed to capture human features in distorted images, using dynamic methods to extract highly variable details. The DTPM accelerates inference by pruning unimportant tokens from each layer of the Vision Transformer (ViT). Unlike traditional cropping approaches, the cropped tokens are preserved using feature maps and selectively reactivated in subsequent network layers to improve model performance. To validate the effectiveness of Vision Transformer in Distorted Image (ViT-DI), we extend the ADE20K dataset and conduct experiments on the constructed dataset and the Cityscapes dataset. Our method achieves an mIoU increase of 1.6 and an FPS increase of 4.4 on the ADE20K dataset, and an mIoU increase of 0.77 and an FPS increase of 2.9 on the Cityscapes dataset, with a reduction in model size of approximately 130 GFLOPs. The URL to our dataset is: <uri>https://github.com/GitHubYuxiao/ViT-DI</uri>.https://ieeexplore.ieee.org/document/10769079/Human semantic segmentationwide-angle distorted imageshuman spatial priortoken pruningViT |
| spellingShingle | Xiao Yu Yunfeng Hua Siyun Zhang Zhaocheng Xu Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers IEEE Access Human semantic segmentation wide-angle distorted images human spatial prior token pruning ViT |
| title | Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers |
| title_full | Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers |
| title_fullStr | Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers |
| title_full_unstemmed | Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers |
| title_short | Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers |
| title_sort | human body segmentation in wide angle images based on fast vision transformers |
| topic | Human semantic segmentation wide-angle distorted images human spatial prior token pruning ViT |
| url | https://ieeexplore.ieee.org/document/10769079/ |
| work_keys_str_mv | AT xiaoyu humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers AT yunfenghua humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers AT siyunzhang humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers AT zhaochengxu humanbodysegmentationinwideangleimagesbasedonfastvisiontransformers |