VIPA-VLA follows a three-stage training framework designed to endow VLA models with strong 3D spatial understanding before learning robot actions: (1) 3D-Visual-Pretraining: Starting from a VLM backbone, we use human demonstration videos with 3D visual annotations to align 2D visual features with 3D spatial representations through a dual-encoder fusion module. (2) 3D-Action-Pretraining: We further learn 3D motion priors from human hand trajectories, discretizing wrist motion into action tokens that capture how humans interact physically with the world. (3) Robot Post-Training: Finally, the pretrained model is adapted to robot manipulation tasks using robot demonstrations, enabling efficient learning and strong generalization in both simulation and real-world environments.
We introduce Hand3D, a large-scale collection of human manipulation videos enriched with 3D visual and action annotations. The dataset captures diverse hand–object–camera interactions across varied scenes and tasks, providing rich 2D–to–3D correspondences that serve as powerful supervision for Spatial-Aware Pretraining.
VIPA-VLA adopts a dual-encoder design that integrates a semantic visual encoder with a 3D spatial encoder. A lightweight fusion module aligns and combines their representations through cross-attention, producing spatially grounded features. During post-training, an action query interacts with these fused features, and a flow-matching action head predicts continuous 3D actions conditioned on robot states.
VIPA-VLA can predict 3D wrist trajectories after pretrained with 3D action annotations (the red lines are ground-truth human trajectories, while the blue lines are wrist trajectories predicted by VIPA-VLA). Across diverse manipulation scenes, the model accurately infers how visual observations correspond to physical motion, producing smooth and task-aligned trajectories. Notably, VIPA-VLA filters out noisy human hand motions and captures the essential manipulation intent—such as approaching objects along feasible paths or grasping tools at appropriate contact points—demonstrating strong spatial grounding and action understanding.
For real-robot execution, VIPA-VLA can complete tasks with accurate spatial grounding—reliably localizing objects and following multi-step goals. Baselines—trained without Spatial-Aware VLA Pre-training and without our dual-encoder architecture—frequently exhibit spatial misalignment, such as misjudging object positions, reaching toward incorrect regions, or producing inconsistent trajectories that reflect poor 2D–3D correspondence.