Background

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Yicheng Feng1,3,   Wanpeng Zhang1,3,   Ye Wang2,3,   Hao Luo1,3,   Haoqi Yuan1,3,
1PKU   2RUC   3BeingBeyond
Corresponding Author

Overview

VIPA-VLA  learns 2D–to–3D visual–physical grounding from human videos with Spatial-Aware VLA Pretraining, enabling robot policies with stronger spatial understanding and generalization.
VIPA Framework

Framework

VIPA-VLA follows a three-stage training framework designed to endow VLA models with strong 3D spatial understanding before learning robot actions: (1) 3D-Visual-Pretraining: Starting from a VLM backbone, we use human demonstration videos with 3D visual annotations to align 2D visual features with 3D spatial representations through a dual-encoder fusion module. (2) 3D-Action-Pretraining: We further learn 3D motion priors from human hand trajectories, discretizing wrist motion into action tokens that capture how humans interact physically with the world. (3) Robot Post-Training: Finally, the pretrained model is adapted to robot manipulation tasks using robot demonstrations, enabling efficient learning and strong generalization in both simulation and real-world environments.

Dataset

Dataset

We introduce Hand3D, a large-scale collection of human manipulation videos enriched with 3D visual and action annotations. The dataset captures diverse hand–object–camera interactions across varied scenes and tasks, providing rich 2D–to–3D correspondences that serve as powerful supervision for Spatial-Aware Pretraining.

Architecture

Architecture

VIPA-VLA adopts a dual-encoder design that integrates a semantic visual encoder with a 3D spatial encoder. A lightweight fusion module aligns and combines their representations through cross-attention, producing spatially grounded features. During post-training, an action query interacts with these fused features, and a flow-matching action head predicts continuous 3D actions conditioned on robot states.

Features

3D Action Prediction

VIPA-VLA can predict 3D wrist trajectories after pretrained with 3D action annotations (the red lines are ground-truth human trajectories, while the blue lines are wrist trajectories predicted by VIPA-VLA). Across diverse manipulation scenes, the model accurately infers how visual observations correspond to physical motion, producing smooth and task-aligned trajectories. Notably, VIPA-VLA filters out noisy human hand motions and captures the essential manipulation intent—such as approaching objects along feasible paths or grasping tools at appropriate contact points—demonstrating strong spatial grounding and action understanding.

VIPA-VLA
Baseline

Real-World Robot Performance

For real-robot execution, VIPA-VLA can complete tasks with accurate spatial grounding—reliably localizing objects and following multi-step goals. Baselines—trained without Spatial-Aware VLA Pre-training and without our dual-encoder architecture—frequently exhibit spatial misalignment, such as misjudging object positions, reaching toward incorrect regions, or producing inconsistent trajectories that reflect poor 2D–3D correspondence.

Citation

@article{feng2025vipa, title={Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos}, author={Feng, Yicheng and Zhang, Wanpeng and Wang, Ye and Luo, Hao and Yuan, Haoqi and Zheng, Sipeng and Lu, Zongqing}, journal={arXiv preprint arXiv:2512.13080}, year={2025} }