DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models

Wanpeng Zhang1,2,   Ye Wang2,3,   Hao Luo1,2,   Haoqi Yuan1,2,   Yicheng Feng1,2,
Sipeng Zheng2,   Qin Jin3,   Zongqing Lu1,2†
1PKU   2BeingBeyond   3RUC
Corresponding Author

Overview

DiG-Flow  pushes more intelligence toward the foundation model, planning more robust actions for general manipulation.
DiG-Flow Framework

Framework

DiG-Flow is a plug-and-play module for flow-matching based VLAs that rebalances control between the autoregressive foundation model and the flow expert. It embeds model inputs and flow outputs into a unified discrepancy space and uses this signal to gate the flow path, preventing shortcut transports that bypass the pretrained model and steering the expert toward more general, robust actions. DiG-Flow integrates seamlessly into diverse VLA architectures, including π, GR00T, and Being-H.

Key Concept

Key Concept

The key concept of DiG-Flow is to prevent shortcut transports in flow-matching based VLAs: overly flexible transports can fit post-hoc data by warping inputs straight to targets, bypassing the pretrained foundation model and suppressing its generalization. DiG-Flow introduces a gating mechanism that controls the relative contribution of the foundation knowledge and the flow path, discouraging these shortcuts and producing more general, robust behavior.

Features

Robust Control

The benefits of DiG-Flow are particularly pronounced for high-DoF robots. On a complex platform that simultaneously controls the head, body, and dexterous hands, DiG-Flow produces more intelligent control and more stable executions than baselines. Beyond improving success metrics, it remains robust across diverse affordances, unseen objects, and variations in background and lighting.

Long Horizon

DiG-Flow excels on long-horizon tasks. By forcing the foundation model to play a more active role in action generation, it better coordinates the slow reasoning system with the fast controller system, leading to stronger long-horizon performance. Under shifted backgrounds, the advantage of DiG-Flow over standard VLA baselines on long-horizon tasks becomes even more pronounced.

Action Planning

DiG-Flow helps action planning, where its generated chunks exhibit more reasonable behavior. Even under severe goal occlusion, it can still complete the task. For example, in the “wipe-whiteboard” task, many baselines lose track of the goal once it is occluded and fall into endlessly executing “wipe” actions, whereas DiG-Flow enables more coherent action planning and avoids such degenerate behavior.

Spatial Precision

DiG-Flow enhances spatial precision. While the foundation models provide strong visual backbones, standard flow-matching action heads often collapse to shortcut transports that ignore critical spatial cues and fail to fully exploit this capability. By correcting these shortcuts, DiG-Flow ties the flow head more tightly to the visual representation, so that the generated actions more faithfully reflect the spatial reasoning.

Citation

@article{zhang2025digflow, title={DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models}, author={Zhang, Wanpeng and Wang, Ye and Luo, Hao and Yuan, Haoqi and Feng, Yicheng and Zheng, Sipeng and Jin, Qin and Lu, Zongqing}, journal={arXiv preprint arXiv:2512.01715}, year={2025} }