Building autonomous robotic agents capable of achieving human-level performance in real-world embodied tasks is an ultimate goal in humanoid robot research. Recent advances have made significant progress in high-level cognition with Foundation Models (FMs) and low-level skill development for humanoid robots. However, directly combining these components often results in poor robustness and efficiency due to compounding errors in long-horizon tasks and the varied latency of different modules. We introduce Being-0, a hierarchical agent framework that integrates an FM with a modular skill library. The FM handles high-level cognitive tasks such as instruction understanding, task planning, and reasoning, while the skill library provides stable locomotion and dexterous manipulation for low-level control. To bridge the gap between these levels, we propose a novel Connector module, powered by a lightweight vision-language model (VLM). The Connector enhances the FM’s embodied capabilities by translating language-based plans into actionable skill commands and dynamically coordinating locomotion and manipulation to improve task success. With all components, except the FM, deployable on low-cost onboard computation devices, Being-0 achieves efficient, real-time performance on a full-sized humanoid robot equipped with dexterous hands and active vision. Extensive experiments in large indoor environments demonstrate Being-0’s effectiveness in solving complex, long-horizon tasks that require challenging navigation and manipulation subtasks.
In this work, we introduced Being-0, a hierarchical agent framework for humanoid robots, designed to control a humanoid equipped with dexterous hands and active vision to solve long-horizon embodied tasks. The novel VLM-based Connector module effectively bridges the gap between the high-level Foundation Model and low-level skills, significantly enhancing the performance and efficiency of the humanoid agent. Extensive real-world experiments demonstrate Being-0's strong capabilities in navigation, manipulation, and long-horizon task-solving. The results highlight the effectiveness of the proposed Connector, the adjustment method for coordinating navigation and manipulation, and the use of active vision.
Despite these advancements, the current system does not incorporate complex locomotion skills such as crouching, sitting, or jumping. These skills could extend the humanoid's functionality beyond flat-ground settings, enabling tasks like climbing stairs, working from seated positions, or manipulating objects at varying heights. Enhancing these capabilities will be an important direction for future work. Additionally, while the onboard system is efficient, Being-0 still relies on the slow Foundation Model for high-level decision-making. Future research could explore lightweight Foundation Models tailored for robotics applications to further improve the system's efficiency.
@article{yuan2025being,
title={Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills},
author={Yuan, Haoqi and Bai, Yu and Fu, Yuhui and Zhou, Bohan and Feng, Yicheng and Xu, Xinrun and Zhan, Yi and Karlsson, B{\"o}rje F and Lu, Zongqing},
journal={arXiv preprint arXiv:2503.12533},
year={2025}
}