Highlights:
- Introduces ViMoGen-228K, a dataset of 228,000 diverse human motion samples.
- Presents ViMoGen, a diffusion transformer integrating video generation insights into motion generation.
- Develops ViMoGen-light for efficient motion synthesis without video dependencies.
- Releases MBench, a benchmark for motion quality, prompt fidelity, and generalization.
TLDR:
A research team led by Jing Lin and colleagues introduces ViMoGen, a groundbreaking framework uniting insights from video generation to significantly improve the generalization and realism of 3D human motion generation, supported by a new large-scale dataset and evaluation benchmark.
Recent research in computer vision has reached a major milestone with a paper titled *“The Quest for Generalizable Motion Generation: Data, Model, and Evaluation.”* Authored by Jing Lin, Ruisi Wang, Junzhe Lu, Ziqi Huang, Guorui Song, Ailing Zeng, Xian Liu, Chen Wei, Wanqi Yin, Qingping Sun, Zhongang Cai, Lei Yang, and Ziwei Liu, the study addresses one of the biggest challenges in 3D human motion generation (MoGen): poor generalization across diverse contexts. While generative video models (ViGen) have made remarkable progress in learning human behavior from vast audiovisual data, motion generation models have lagged behind, constrained by limited datasets and less flexible approaches. This new work proposes bridging that gap by transferring key strengths from ViGen into MoGen through a unified data, model, and evaluation framework.
At the heart of this effort lies the ViMoGen-228K dataset. This massive collection includes 228,000 high-quality human motion samples derived from multiple sources: high-fidelity motion capture (MoCap) systems, semantically annotated web videos, and synthetically generated motions using advanced video generation models. Unlike previous motion datasets, ViMoGen-228K includes rich multimodal annotations such as text-motion pairs and text-video-motion triplets. This design enhances semantic diversity, allowing AI models to better understand natural movements and contextual cues from textual descriptions. The scale and heterogeneity of this dataset make it a cornerstone for developing truly generalizable motion models.
To exploit this diverse dataset, the authors introduced ViMoGen, a flow-matching-based diffusion transformer that unifies structural priors from MoCap data with the visual priors learned by ViGen models. Using gated multimodal conditioning, ViMoGen can contextually modulate information from text, video, and motion domains, resulting in highly realistic and context-consistent human motion synthesis. Additionally, the team developed ViMoGen-light, a distilled, lightweight version that retains the generalization ability of the full model while removing direct dependencies on video modules—significantly improving efficiency. Complementing these models, the researchers propose MBench, a hierarchical benchmarking tool for assessing model performance across key dimensions such as motion quality, prompt fidelity, and generalization. Experiment results show that ViMoGen and its variants substantially outperform existing motion generation frameworks in both automatic metrics and human evaluations.
This research represents a significant step towards general-purpose human motion synthesis, paving the way for more adaptable virtual humans, realistic animations, and advanced interactions in gaming, robotics, and mixed reality applications. By combining insights from the success of video generation with innovations in motion synthesis, the paper sets a new standard for data-driven human modeling and opens up new directions for future generative AI research.
Source:
Source:
Original research paper: Lin, J., Wang, R., Lu, J., Huang, Z., Song, G., Zeng, A., Liu, X., Wei, C., Yin, W., Sun, Q., Cai, Z., Yang, L., & Liu, Z. (2025). *The Quest for Generalizable Motion Generation: Data, Model, and Evaluation.* arXiv:2510.26794 [cs.CV]. DOI: https://doi.org/10.48550/arXiv.2510.26794
