Scaling Motion Generation Models with Million-Level Human Motions


RUC     BAAI     PKU     BeingBeyond
§ Corresponding author

Abstract

Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted toward developing large motion models. Despite some progress, current efforts remain far from achieving truly generalist models, primarily due to the lack of massive high-quality data. To address this gap, we present MotionLib, the first million-level dataset for motion generation, which is at least 15× larger than existing counterparts and enriched with hierarchical text descriptions. Using MotionLib, we train a large motion model named Being-M0, demonstrating robust performance across a wide range of human activities, including unseen ones. Through systematic investigation, for the first time, we highlight the importance of scaling both data and model size for advancing motion generation, along with key insights to achieve this goal. To better integrate the motion modality, we propose MotionBook, an innovative motion encoding approach including (1) a compact yet lossless feature to represent motions; (2) a novel 2D lookup-free motion tokenizer that preserves fine-grained motion details while expanding codebook capacity, significantly enhancing the representational power of motion tokens. We believe this work lays the groundwork for developing more versatile and powerful motion generation models in the future.

MotionLib



["Can scaling the large motion model and data benefit motion generation?"] To answer this question, we develop a systematic data collection pipeline to build MotionLib, the first large-scale dataset containing over 1.2M motion sequences -- at least 15× larger than current counterparts. This initiative provides a solid foundation for building robust, universally applicable motion models and offers a comprehensive testbed for future research. Using MotionLib, we conduct a comprehensive investigation into the large motion model. For the first time, we show the scaling law of both data and model size in motion generation, which significantly reduces joint prediction errors while improving generalization to novel motions.





[Comparison with existing human motion datasets] In the table, B, H, and F refer to body, hand, and face, respectively. "hier" indicates that the text captions include hierarchical descriptions of motions, while "body" means the descriptions are not as detailed. "multi" and "single" specify whether the dataset contains multi-person scenarios or only single-person data. As the largest motion generation dataset and benchmark to date, MotionLib features at least 15× more motion and text data than previous datasets, along with additional modalities.





[Data Distribution] MotionLib comprises over 1.2 million motion sequences collected from various public datasets and web videos. A significant portion of MotionLib is derived from open-source, human-related datasets, such as 698.5K motions from Kinetics-700 and 137.5K motions from NTU-RGBD-120. Additionally, MotionLib integrates motions from other established datasets, including BEDLAM and GTA-Human. MotionLib also includes subsets of the Motion-X collection, covering a diverse range of categories such as Animation, Perform, Dance, AIST, Kungfu, GRAB, Music, Idea400, HAA500, Game Motion, and Fitness. It is worth noting that the Motion-X subsets constitute only a small portion of the overall MotionLib dataset (around 6.7%). The left figure illustrates the scale distribution of motion sequences within the subsets of MotionLib, while the right figure shows the average frame number of each subset.

Being-M0



[Being-M0] Using MotionLib, we train a large motion model named Being-M0, demonstrating robust performance across a wide range of human activities, including unseen ones. To better integrate the motion modality, we propose MotionBook, an innovative motion encoding approach including (1) a compact yet lossless feature to represent motions; (2) a novel 2D lookup-free motion tokenizer that preserves fine-grained motion details while expanding codebook capacity, significantly enhancing the representational power of motion tokens. The training of Being-M0 can be divided into two stages. In the first stage, we pre-train a motion VQ-VAE to quantify motion sequences into tokens. In the second stage, we fine-tune an autoregressive language model to predict motion tokens.





[SOTA performance] We evaluate our model on the widely adopted HumanML3D benchmark. We compare its performance against a variety of SoTA approaches. Being-M0, which utilizes LLaMA2-13B as the motion decoder, achieves SOTA performance and significantly outperforms other LLM-based methods such as AvatarGPT and the latest MotionGPT-v2. In particular, we observe substantial improvements in key metrics such as R@1, R@3, and MMDist, highlighting our model's ability to generate motion sequences that are better aligned with text descriptions and of higher quality.






[Text2Motion examples] We visualize the human motions generated by Being-M0 trained on the MotionLib dataset. The visualized motion sequences demonstrate that our model can produce motion sequences that closely align with the input texts, highlighting the effectiveness of MotionLib as a training resource.

Animated Characters

Humanoid Robots



[Animation/Robot Motion] Being-M0 generates motions that can be retargeted to any animated characters and to humanoid robots like Unitree H1, enabling them to perform human-like actions without pre-defined execution. There are many potential applications of motion models in both simulated and real worlds. Real-world Demo is on the way.

Conclusion

In this paper, we explore how to advance the field of large motion model. To this end, we introduce MotionLib, the first million-level dataset comprising over 1.2 million motions with hierarchical texts. Building on MotionLib, we present key insights into scaling up both data and model size for large-scale motion training. Furthermore, we propose MotionBook, a novel motion encoding approach designed to maximize the benefits when trained on extensive motion data. MotionBook incorporates compact yet lossless features to represent motion, and introduces a novel 2D-LFQ motion quantization method that treats each motion sequence as a 2D image, constructing a finite-scale codebook that eliminates the need for token lookups. Leveraging these advancements, we train Puppet, a large motion model that achieves SoTA results compared to current counterparts.

BibTeX

@inproceedings{wang2025scaling,
title={Scaling Motion Generation Model with Million-Level Human Motions},
author={Wang, Ye and Zheng, Sipeng and Cao, Bin and Wei, Qianshan and Zeng, Weishuai and Jin, Qin and Lu, Zongqing},
booktitle={International Conference on Machine Learning (ICML)},
year={2025}}