Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes

Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, a...

Full description

Saved in:
Bibliographic Details
Main Authors: Hitoshi Teshima, Naoki Wake, Diego Thomas, Yuta Nakashima, Hiroshi Kawasaki, Katsushi Ikeuchi
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10967356/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849722165321007104
author Hitoshi Teshima
Naoki Wake
Diego Thomas
Yuta Nakashima
Hiroshi Kawasaki
Katsushi Ikeuchi
author_facet Hitoshi Teshima
Naoki Wake
Diego Thomas
Yuta Nakashima
Hiroshi Kawasaki
Katsushi Ikeuchi
author_sort Hitoshi Teshima
collection DOAJ
description Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis.
format Article
id doaj-art-4d30aa1d37d64f2cbaaa90fd912cd89e
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-4d30aa1d37d64f2cbaaa90fd912cd89e2025-08-20T03:11:25ZengIEEEIEEE Access2169-35362025-01-0113738187383010.1109/ACCESS.2025.356208610967356Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From ScenesHitoshi Teshima0https://orcid.org/0000-0002-6431-4514Naoki Wake1https://orcid.org/0000-0001-8278-2373Diego Thomas2https://orcid.org/0000-0002-8525-7133Yuta Nakashima3https://orcid.org/0000-0001-8000-3567Hiroshi Kawasaki4https://orcid.org/0000-0001-5825-6066Katsushi Ikeuchi5https://orcid.org/0000-0001-9758-9357Department of Information Science and Technology, Kyushu University, Fukuoka, JapanApplied Robotics Research, Microsoft, Redmond, WA, USADepartment of Information Science and Technology, Kyushu University, Fukuoka, JapanInstitute for Datability Science, Osaka University, Osaka, JapanDepartment of Information Science and Technology, Kyushu University, Fukuoka, JapanApplied Robotics Research, Microsoft, Redmond, WA, USAGenerating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis.https://ieeexplore.ieee.org/document/10967356/Text-to-motion generation3D scene understandinghuman-object interactionaffordance-based interactionmotion diffusion models
spellingShingle Hitoshi Teshima
Naoki Wake
Diego Thomas
Yuta Nakashima
Hiroshi Kawasaki
Katsushi Ikeuchi
Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
IEEE Access
Text-to-motion generation
3D scene understanding
human-object interaction
affordance-based interaction
motion diffusion models
title Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_full Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_fullStr Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_full_unstemmed Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_short Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
title_sort text guided diverse scene interaction synthesis by disentangling actions from scenes
topic Text-to-motion generation
3D scene understanding
human-object interaction
affordance-based interaction
motion diffusion models
url https://ieeexplore.ieee.org/document/10967356/
work_keys_str_mv AT hitoshiteshima textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes
AT naokiwake textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes
AT diegothomas textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes
AT yutanakashima textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes
AT hiroshikawasaki textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes
AT katsushiikeuchi textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes