DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation

Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a...

Full description

Saved in:
Bibliographic Details
Main Authors: Matteo Fincato, Roberto Vezzani
Format: Article
Language:English
Published: MDPI AG 2025-05-01
Series:Sensors
Subjects:
Online Access:https://www.mdpi.com/1424-8220/25/10/2997
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850256146769641472
author Matteo Fincato
Roberto Vezzani
author_facet Matteo Fincato
Roberto Vezzani
author_sort Matteo Fincato
collection DOAJ
description Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a novel framework that enhances multi-person pose estimation by leveraging a dual-block transformer decoding architecture. Class prediction and keypoint estimation are split into parallel blocks so each sub-task can be separately improved and the risk of interference is reduced. This architecture improves the precision of keypoint localization and the model’s capacity to accurately classify individuals. To improve model performance, the Keypoint-Block uses parallel processing of self-attentions, providing a novel strategy that improves keypoint localization accuracy and precision. Additionally, DualPose incorporates a contrastive denoising (CDN) mechanism, leveraging positive and negative samples to stabilize training and improve robustness. Thanks to CDN, a variety of training samples are created by introducing controlled noise into the ground truth, improving the model’s ability to discern between valid and incorrect keypoints. DualPose achieves state-of-the-art results outperforming recent end-to-end methods, as shown by extensive experiments on the MS COCO and CrowdPose datasets. The code and pretrained models are publicly available.
format Article
id doaj-art-b85c7af2d22f4fe3ac49daeab403932f
institution OA Journals
issn 1424-8220
language English
publishDate 2025-05-01
publisher MDPI AG
record_format Article
series Sensors
spelling doaj-art-b85c7af2d22f4fe3ac49daeab403932f2025-08-20T01:56:42ZengMDPI AGSensors1424-82202025-05-012510299710.3390/s25102997DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose EstimationMatteo Fincato0Roberto Vezzani1Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Via P. Vivarelli 10, 41125 Modena, ItalyDepartment of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Via P. Vivarelli 10, 41125 Modena, ItalyMulti-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a novel framework that enhances multi-person pose estimation by leveraging a dual-block transformer decoding architecture. Class prediction and keypoint estimation are split into parallel blocks so each sub-task can be separately improved and the risk of interference is reduced. This architecture improves the precision of keypoint localization and the model’s capacity to accurately classify individuals. To improve model performance, the Keypoint-Block uses parallel processing of self-attentions, providing a novel strategy that improves keypoint localization accuracy and precision. Additionally, DualPose incorporates a contrastive denoising (CDN) mechanism, leveraging positive and negative samples to stabilize training and improve robustness. Thanks to CDN, a variety of training samples are created by introducing controlled noise into the ground truth, improving the model’s ability to discern between valid and incorrect keypoints. DualPose achieves state-of-the-art results outperforming recent end-to-end methods, as shown by extensive experiments on the MS COCO and CrowdPose datasets. The code and pretrained models are publicly available.https://www.mdpi.com/1424-8220/25/10/2997contrastive denoisingDualPosehuman pose estimationmulti-person pose estimationtransformer-based models
spellingShingle Matteo Fincato
Roberto Vezzani
DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation
Sensors
contrastive denoising
DualPose
human pose estimation
multi-person pose estimation
transformer-based models
title DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation
title_full DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation
title_fullStr DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation
title_full_unstemmed DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation
title_short DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation
title_sort dualpose dual block transformer decoder with contrastive denoising for multi person pose estimation
topic contrastive denoising
DualPose
human pose estimation
multi-person pose estimation
transformer-based models
url https://www.mdpi.com/1424-8220/25/10/2997
work_keys_str_mv AT matteofincato dualposedualblocktransformerdecoderwithcontrastivedenoisingformultipersonposeestimation
AT robertovezzani dualposedualblocktransformerdecoderwithcontrastivedenoisingformultipersonposeestimation