Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes

Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, a...

Full description

Saved in:

Bibliographic Details
Main Authors:	Hitoshi Teshima, Naoki Wake, Diego Thomas, Yuta Nakashima, Hiroshi Kawasaki, Katsushi Ikeuchi
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Text-to-motion generation 3D scene understanding human-object interaction affordance-based interaction motion diffusion models
Online Access:	https://ieeexplore.ieee.org/document/10967356/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849722165321007104
author	Hitoshi Teshima Naoki Wake Diego Thomas Yuta Nakashima Hiroshi Kawasaki Katsushi Ikeuchi
author_facet	Hitoshi Teshima Naoki Wake Diego Thomas Yuta Nakashima Hiroshi Kawasaki Katsushi Ikeuchi
author_sort	Hitoshi Teshima
collection	DOAJ
description	Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis.
format	Article
id	doaj-art-4d30aa1d37d64f2cbaaa90fd912cd89e
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-4d30aa1d37d64f2cbaaa90fd912cd89e2025-08-20T03:11:25ZengIEEEIEEE Access2169-35362025-01-0113738187383010.1109/ACCESS.2025.356208610967356Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From ScenesHitoshi Teshima0https://orcid.org/0000-0002-6431-4514Naoki Wake1https://orcid.org/0000-0001-8278-2373Diego Thomas2https://orcid.org/0000-0002-8525-7133Yuta Nakashima3https://orcid.org/0000-0001-8000-3567Hiroshi Kawasaki4https://orcid.org/0000-0001-5825-6066Katsushi Ikeuchi5https://orcid.org/0000-0001-9758-9357Department of Information Science and Technology, Kyushu University, Fukuoka, JapanApplied Robotics Research, Microsoft, Redmond, WA, USADepartment of Information Science and Technology, Kyushu University, Fukuoka, JapanInstitute for Datability Science, Osaka University, Osaka, JapanDepartment of Information Science and Technology, Kyushu University, Fukuoka, JapanApplied Robotics Research, Microsoft, Redmond, WA, USAGenerating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis.https://ieeexplore.ieee.org/document/10967356/Text-to-motion generation3D scene understandinghuman-object interactionaffordance-based interactionmotion diffusion models
spellingShingle	Hitoshi Teshima Naoki Wake Diego Thomas Yuta Nakashima Hiroshi Kawasaki Katsushi Ikeuchi Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes IEEE Access Text-to-motion generation 3D scene understanding human-object interaction affordance-based interaction motion diffusion models
title	Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_full	Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_fullStr	Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_full_unstemmed	Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_short	Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_sort	text guided diverse scene interaction synthesis by disentangling actions from scenes
topic	Text-to-motion generation 3D scene understanding human-object interaction affordance-based interaction motion diffusion models
url	https://ieeexplore.ieee.org/document/10967356/
work_keys_str_mv	AT hitoshiteshima textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT naokiwake textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT diegothomas textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT yutanakashima textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT hiroshikawasaki textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT katsushiikeuchi textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes

Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes

Similar Items