•  
  •  
 

Journal of System Simulation

Abstract

Abstract: Full-body co-speech gesture generation significantly enhances the interactivity of virtual digital humans, requiring generated gestures to not only align accurately with speech but also demonstrate realistic full-body dynamics. To address limitations of existing methods—Transformer-based approaches often overlook temporal features of action sequences, while diffusion model-based ones inadequately capture spatial correlations between body parts, a full-body action generation method integrating diffusion models, Mamba, and attention mechanisms is proposed. We introduce the spatial self-attention-temporal state space model (STMamba Layer) as the core of denoising network to extract
inter-part spatial features and intra-part temporal features, thus enhancing action quality and diversity. Body motion sequences are modeled in two dimensions: spatially, rotational relative positional encoding and self-attention capture spatial correlations among body joint points; Mamba captures intra-part temporal dynamics in action sequences to boost continuity. Experiments and evaluations on the largescale audio-text-action dataset BEAT2 demonstrate that the proposed method outperforms state-of-the-art approaches in both fidelity and diversity, while maintaining competitive inference speed despite performance gains.

First Page

211

Last Page

224

CLC

TP.391.41

Recommended Citation

Zhang Shuozhe, Song Wenfeng, Hou Xia, et al. Full-body Co-speech Gesture Generation Based on Spatial-temporal Enhanced Generation Model[J]. Journal of System Simulation, 2026, 38(1): 211-224.

Corresponding Author

Song Wenfeng

DOI

10.16182/j.issn1004731x.joss.25-0833

Share

COinS