SGAR: Structural Generative Augmentation for 3D Human Motion Retrieval

0citations
0
Citations
#2081
in NeurIPS 2025
of 5858 papers
4
Authors
4
Data Points

Abstract

3D human motion-text retrieval is essential for accurate motion understanding, targeted at cross-modal alignment learning. Existing methods typically align the global motion-text concepts directly, suffering from sub-optimal generalization due to the uncertainty of correspondence learning between multiple motion concepts coupled in a single motion/text sequence. Therefore, we study the explicit fine-grained concept decomposition for alignment learning and present a novel framework, Structural Generative Augmentation for 3D Human Motion Retrieval (SGAR), to enable generation-augmented retrieval. Specifically, relying on the strong priors of existing large language model (LLM) assets, we effectively decompose human motions structurally into subtler semantic units, \ie, body parts, for fine-grained motion modeling. Based on this, we develop part-mixture learning to better decouple the local motion concept learning, boosting part-level alignment. Moreover, a directional relation alignment strategy exploiting the correspondence between full-body and part motions is incorporated to regularize feature manifold for better consistency. Extensive experiments on three benchmarks, including motion-text retrieval as well as recognition and generation applications, demonstrate the superior performance and promising transferability of our method.

Citation History

Jan 25, 2026
0
Jan 27, 2026
0
Jan 27, 2026
0
Jan 30, 2026
0