DemoFunGrasp

Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning

Chuan Mao1   Haoqi Yuan1,2   Ziye Huang1,2   Chaoyi Xu2   Kai Ma1   Zongqing Lu§1,2


1Peking University     2BeingBeyond


§Corresponding Author

Video Gallery

Dish soap bottle

Bouquet

Pan

Toolbox

Kettle

Yellow Teapot

Rasp

Brown Teapot

Wooden Holder

Saucepan

Watering Pitcher

Bowl

Small Bucket

Hand Sanitizer Bottle

Spray Bottle

Abstract

    Reinforcement learning (RL) has achieved great success in dexterous grasping, significantly improving grasp performance and generalization from simulation to the real world. However, fine-grained functional grasping, which is essential for downstream manipulation tasks, remains underexplored and faces several challenges: the complexity of specifying goals and reward functions for functional grasps across diverse objects, the difficulty of multi-task RL exploration, and the challenge of sim-to-real transfer. In this work, we propose DemoFunGrasp for universal dexterous functional grasping. We factorize functional grasping conditions into two complementary components — grasping style and affordance — and integrate them into an RL framework that can learn to grasp any object with any functional grasping condition. To address the multi-task optimization challenge, we leverage a single grasping demonstration and reformulate the RL problem as one-step demonstration editing, substantially enhancing sample efficiency and performance. Experimental results in both simulation and the real world show that DemoFunGrasp generalizes to unseen combinations of objects, affordances, and grasping styles, outperforming baselines in both success rate and functional grasping accuracy. In addition to strong sim-to-real capability, by incorporating a vision-language model (VLM) for planning, our system achieves autonomous instruction-following grasp execution.
overview

    DemoFunGrasp is a reinforcement learning framework for universal dexterous functional grasping. The learned policy generalizes to unseen combinations of objects and functional grasping conditions, and achieves zero-shot sim-to-real transfer. For the same object, the policy can produce diverse grasps by adjusting the grasping style and affordance.
pipeline

Overview of DemoFunGrasp. The framework consists of four key components:

  1. Demonstration editing: A source demonstration is adapted through end-effector transformation and object-geometry–aware hand style adjustment.
  2. Functional grasping policy learning: An affordance- and style-conditioned one-step RL policy is trained.
  3. Vision-based imitation: The learned policy is transferred to RGB observations for closed-loop, vision-based execution.
  4. Real-world deployment: The vision-based policy is guided by a Vision-Language Model (VLM) for autonomous planning and execution.
dexgraspnet

  1. Simulation and Real-World Results with Human-Annotated and VLM-Predicted Grasping Conditions.
  2. Our results show that DemoFunGrasp generalizes effectively to unseen combinations of objects, affordance targets, and grasp styles, achieving consistently high success rates(GSR), affordance accuracy(IAS), and style adherence(ISS) in both simulation and real-world evaluations. When integrated with a Vision-Language Model for autonomous instruction-following grasp execution, our system attains an average real-world success rate of 64%, demonstrating that the proposed framework serves as a reliable and capable execution module within VLM-driven robotic systems.

BibTeX

      
        @misc{mao2025universaldexterousfunctionalgrasping,
          title={Universal Dexterous Functional Grasping via Demonstration-Editing Reinforcement Learning}, 
          author={Chuan Mao and Haoqi Yuan and Ziye Huang and Chaoyi Xu and Kai Ma and Zongqing Lu},
          year={2025},
          eprint={2512.13380},
          archivePrefix={arXiv},
          primaryClass={cs.RO},
          url={https://arxiv.org/abs/2512.13380}, 
        }