Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes
Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, a...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10967356/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849722165321007104 |
|---|---|
| author | Hitoshi Teshima Naoki Wake Diego Thomas Yuta Nakashima Hiroshi Kawasaki Katsushi Ikeuchi |
| author_facet | Hitoshi Teshima Naoki Wake Diego Thomas Yuta Nakashima Hiroshi Kawasaki Katsushi Ikeuchi |
| author_sort | Hitoshi Teshima |
| collection | DOAJ |
| description | Generating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis. |
| format | Article |
| id | doaj-art-4d30aa1d37d64f2cbaaa90fd912cd89e |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-4d30aa1d37d64f2cbaaa90fd912cd89e2025-08-20T03:11:25ZengIEEEIEEE Access2169-35362025-01-0113738187383010.1109/ACCESS.2025.356208610967356Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From ScenesHitoshi Teshima0https://orcid.org/0000-0002-6431-4514Naoki Wake1https://orcid.org/0000-0001-8278-2373Diego Thomas2https://orcid.org/0000-0002-8525-7133Yuta Nakashima3https://orcid.org/0000-0001-8000-3567Hiroshi Kawasaki4https://orcid.org/0000-0001-5825-6066Katsushi Ikeuchi5https://orcid.org/0000-0001-9758-9357Department of Information Science and Technology, Kyushu University, Fukuoka, JapanApplied Robotics Research, Microsoft, Redmond, WA, USADepartment of Information Science and Technology, Kyushu University, Fukuoka, JapanInstitute for Datability Science, Osaka University, Osaka, JapanDepartment of Information Science and Technology, Kyushu University, Fukuoka, JapanApplied Robotics Research, Microsoft, Redmond, WA, USAGenerating human motion within 3D scenes from textual descriptions remains a challenging task because of the scarcity of hybrid datasets encompassing text, 3D scenes, and motion. Existing approaches suffer from fundamental limitations: a lack of datasets that integrate text, 3D scenes, and motion, and a reliance on end-to-end methods, which constrain the diversity and realism of generated human-scene interactions. In this paper, we propose a novel method to generate motions of humans interacting with objects in a 3D scene given a textual prompt. Our key innovation focuses on decomposing the motion generation task into distinct steps: 1) generating key poses from textual and scene contexts and 2) synthesizing full motion trajectories guided by these key poses and path planning. This approach eliminates the need for hybrid datasets by leveraging independent text-motion and pose datasets, significantly expanding action diversity and overcoming the constraints of prior works. Unlike previous methods, which focus on limited action types or rely on scarce datasets, our approach enables scalable and adaptable motion generation. Through extensive experiments, we demonstrate that our framework achieves unparalleled diversity and contextually accurate motions, advancing the state-of-the-art in human-scene interaction synthesis.https://ieeexplore.ieee.org/document/10967356/Text-to-motion generation3D scene understandinghuman-object interactionaffordance-based interactionmotion diffusion models |
| spellingShingle | Hitoshi Teshima Naoki Wake Diego Thomas Yuta Nakashima Hiroshi Kawasaki Katsushi Ikeuchi Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes IEEE Access Text-to-motion generation 3D scene understanding human-object interaction affordance-based interaction motion diffusion models |
| title | Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes |
| title_full | Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes |
| title_fullStr | Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes |
| title_full_unstemmed | Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes |
| title_short | Text-Guided Diverse Scene Interaction Synthesis by Disentangling Actions From Scenes |
| title_sort | text guided diverse scene interaction synthesis by disentangling actions from scenes |
| topic | Text-to-motion generation 3D scene understanding human-object interaction affordance-based interaction motion diffusion models |
| url | https://ieeexplore.ieee.org/document/10967356/ |
| work_keys_str_mv | AT hitoshiteshima textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT naokiwake textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT diegothomas textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT yutanakashima textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT hiroshikawasaki textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes AT katsushiikeuchi textguideddiversesceneinteractionsynthesisbydisentanglingactionsfromscenes |