UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model

Diffusion models have demonstrated substantial success in controllable generation for continuous modalities, positioning them as highly suitable for tasks such as human motion generation. However, existing approaches are typically limited to single-task applications, such as text-to-motion generatio...

Full description

Saved in:
Bibliographic Details
Main Authors: Song Lin, Wenjun Hou
Format: Article
Language:English
Published: IEEE 2024-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10802885/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1846099861568487424
author Song Lin
Wenjun Hou
author_facet Song Lin
Wenjun Hou
author_sort Song Lin
collection DOAJ
description Diffusion models have demonstrated substantial success in controllable generation for continuous modalities, positioning them as highly suitable for tasks such as human motion generation. However, existing approaches are typically limited to single-task applications, such as text-to-motion generation, and often lack versatility and editing capabilities. To overcome these limitations, we propose UniMotion-DM, a unified framework for both text-motion generation and editing based on diffusion models. UniMotion-DM integrates three core components: 1) a Contrastive Text-Motion Variational Autoencoder (CTMV), which aligns text and motion in a shared latent space using contrastive learning; 2) a controllable diffusion model tailored to the CTMV representation for generating and editing multimodal content; and 3) a Multimodal Conditional Representation and Editing (MCRE) module that leverages CLIP embeddings to enable precise and flexible control across various tasks. The ability of UniMotion-DM to seamlessly handle text-to-motion generation, motion captioning, motion completion, and multimodal editing results in significant improvements in both quantitative and qualitative evaluations. Beyond conventional domains such as gaming and virtual reality, we emphasize UniMotion-DM’s potential in underexplored fields such as healthcare and creative industries. For example, UniMotion-DM could be used to generate personalized physical therapy routines or assist designers in rapidly prototyping motion-based narratives. By addressing these emerging applications, UniMotion-DM paves the way for utilizing multimodal generative models in interdisciplinary and socially impactful areas.
format Article
id doaj-art-01ba0cbcac2644de91f55eaa0f59bbdb
institution Kabale University
issn 2169-3536
language English
publishDate 2024-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-01ba0cbcac2644de91f55eaa0f59bbdb2024-12-31T00:01:02ZengIEEEIEEE Access2169-35362024-01-011219698419699910.1109/ACCESS.2024.351830010802885UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion ModelSong Lin0https://orcid.org/0009-0008-3355-599XWenjun Hou1School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, ChinaBeijing Key Laboratory of Network Systems and Network Culture, Beijing University of Posts and Telecommunications, Beijing, ChinaDiffusion models have demonstrated substantial success in controllable generation for continuous modalities, positioning them as highly suitable for tasks such as human motion generation. However, existing approaches are typically limited to single-task applications, such as text-to-motion generation, and often lack versatility and editing capabilities. To overcome these limitations, we propose UniMotion-DM, a unified framework for both text-motion generation and editing based on diffusion models. UniMotion-DM integrates three core components: 1) a Contrastive Text-Motion Variational Autoencoder (CTMV), which aligns text and motion in a shared latent space using contrastive learning; 2) a controllable diffusion model tailored to the CTMV representation for generating and editing multimodal content; and 3) a Multimodal Conditional Representation and Editing (MCRE) module that leverages CLIP embeddings to enable precise and flexible control across various tasks. The ability of UniMotion-DM to seamlessly handle text-to-motion generation, motion captioning, motion completion, and multimodal editing results in significant improvements in both quantitative and qualitative evaluations. Beyond conventional domains such as gaming and virtual reality, we emphasize UniMotion-DM’s potential in underexplored fields such as healthcare and creative industries. For example, UniMotion-DM could be used to generate personalized physical therapy routines or assist designers in rapidly prototyping motion-based narratives. By addressing these emerging applications, UniMotion-DM paves the way for utilizing multimodal generative models in interdisciplinary and socially impactful areas.https://ieeexplore.ieee.org/document/10802885/Diffusion-based multimodal generationmultiple taskscontrastive learning
spellingShingle Song Lin
Wenjun Hou
UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model
IEEE Access
Diffusion-based multimodal generation
multiple tasks
contrastive learning
title UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model
title_full UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model
title_fullStr UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model
title_full_unstemmed UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model
title_short UniMotion-DM: Uniform Text-Motion Generation and Editing via Diffusion Model
title_sort unimotion dm uniform text motion generation and editing via diffusion model
topic Diffusion-based multimodal generation
multiple tasks
contrastive learning
url https://ieeexplore.ieee.org/document/10802885/
work_keys_str_mv AT songlin unimotiondmuniformtextmotiongenerationandeditingviadiffusionmodel
AT wenjunhou unimotiondmuniformtextmotiongenerationandeditingviadiffusionmodel