•  
  •  
 

Journal of System Simulation

Abstract

Abstract: To improve the accuracy, controllability, and realism of text-driven human motion generation, a novel method is proposed that integrates fine-grained textual semantics with spatial control signals. Within the diffusion model framework, both global text tokens and body-part-level local tokens are introduced. These are encoded using CLIP to obtain corresponding features, which are then fed into the motion diffusion model to enable fine control over different body parts. Spatial guidance is used to dynamically adjust joint positions during the diffusion denoising process, ensuring that the generated motion adheres to spatial constraints. Realism guidance is incorporated to enhance the naturalness and overall coordination of uncontrolled joints. Experiments conducted on the HumanML3D dataset involved fine-grained rewriting of 44 970 text samples using ChatGPT-4o to improve semantic alignment between text and motion. Results demonstrate that the proposed method outperforms existing approaches in motion semantic consistency, spatial control accuracy, and generation quality, and is capable of producing human motions that meet user expectations in both semantic alignment and motion quality.

First Page

136

Last Page

157

CLC

TP391.9

Recommended Citation

Jiang Binze, Song Wenfeng, Hou Xia, et al. Diffusion Model for Human Motion Generation with Finegrained Text and Spatial Control Signals[J]. Journal of System Simulation, 2026, 38(1): 136-157.

Corresponding Author

Song Wenfeng

DOI

10.16182/j.issn1004731x.joss.25-0832

Share

COinS