Highlights:
- New framework bridges video generation (ViGen) and motion generation (MoGen) to improve human motion modeling.
- Introduction of ViMoGen-228K, a large-scale dataset with 228,000 high-quality text-motion and text-video-motion samples.
- ViMoGen: a diffusion transformer using multimodal conditioning for robust and generalizable motion synthesis.
- Creation of MBench, a hierarchical benchmark for evaluating motion quality, prompt fidelity, and generalization.
TLDR:
A research team led by Jing Lin and colleagues introduces ViMoGen, a novel AI framework that leverages insights from video generation to revolutionize 3D human motion generation. The system enhances generalization, efficiency, and evaluation through a new dataset, model, and benchmark, marking a major step forward for motion understanding in AI.
Human motion generation (MoGen) stands as a pivotal field within computer vision and artificial intelligence, influencing areas ranging from animation and robotics to virtual reality. Yet, despite rapid progress, existing MoGen systems often struggle to generalize beyond their training datasets. Addressing this challenge, a research team led by Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, and Ziwei Liu has released a groundbreaking study titled ‘The Quest for Generalizable Motion Generation: Data, Model, and Evaluation.’ This work introduces an end-to-end framework that strategically transfers knowledge from the more mature domain of video generation (ViGen) to advance motion synthesis capabilities.
At the core of this advancement is the ViMoGen-228K dataset — a comprehensive collection of 228,000 annotated 3D motion sequences integrating optical motion-capture (MoCap) data, annotated web video motions, and high-quality samples synthesized from state-of-the-art ViGen models. The inclusion of both text-motion pairs and text-video-motion triplets significantly enhances the dataset’s semantic richness, making it ideal for training models that understand not just how humans move, but why and when they do so. This multimodal dataset forms the backbone for the team’s novel generative model.
The researchers propose ViMoGen, a diffusion transformer model built on a flow-matching paradigm, designed to merge the strengths of real MoCap priors with synthetic ViGen priors through gated multimodal conditioning. This design allows the model to generate realistic, contextually accurate motion sequences from diverse textual or visual prompts. To further improve computational efficiency, the team also developed ViMoGen-light, a streamlined variant that retains generalization capabilities while removing direct dependencies on video generation pipelines. Complementing these innovations, the team introduces MBench — a hierarchical benchmark to systematically evaluate motion quality, prompt fidelity, and generalization. Empirical results confirm that ViMoGen significantly outperforms existing MoGen methods across both automated metrics and human evaluations. Combined, these contributions pave the way for more adaptable, scalable, and realistic motion generation systems applicable across entertainment, simulation, and embodied AI domains.
Source:
Source:
arXiv:2510.26794v1 [cs.CV] — DOI: https://doi.org/10.48550/arXiv.2510.26794 (Submitted on 30 Oct 2025)
