DiG-Flow is a plug-and-play module for flow-matching based VLAs that rebalances control between the autoregressive foundation model and the flow expert. It embeds model inputs and flow outputs into a unified discrepancy space and uses this signal to gate the flow path, preventing shortcut transports that bypass the pretrained model and steering the expert toward more general, robust actions. DiG-Flow integrates seamlessly into diverse VLA architectures, including π, GR00T, and Being-H.
The key concept of DiG-Flow is to prevent shortcut transports in flow-matching based VLAs: overly flexible transports can fit post-hoc data by warping inputs straight to targets, bypassing the pretrained foundation model and suppressing its generalization. DiG-Flow introduces a gating mechanism that controls the relative contribution of the foundation knowledge and the flow path, discouraging these shortcuts and producing more general, robust behavior.
The benefits of DiG-Flow are particularly pronounced for high-DoF robots. On a complex platform that simultaneously controls the head, body, and dexterous hands, DiG-Flow produces more intelligent control and more stable executions than baselines. Beyond improving success metrics, it remains robust across diverse affordances, unseen objects, and variations in background and lighting.
DiG-Flow excels on long-horizon tasks. By forcing the foundation model to play a more active role in action generation, it better coordinates the slow reasoning system with the fast controller system, leading to stronger long-horizon performance. Under shifted backgrounds, the advantage of DiG-Flow over standard VLA baselines on long-horizon tasks becomes even more pronounced.
DiG-Flow helps action planning, where its generated chunks exhibit more reasonable behavior. Even under severe goal occlusion, it can still complete the task. For example, in the “wipe-whiteboard” task, many baselines lose track of the goal once it is occluded and fall into endlessly executing “wipe” actions, whereas DiG-Flow enables more coherent action planning and avoids such degenerate behavior.
DiG-Flow enhances spatial precision. While the foundation models provide strong visual backbones, standard flow-matching action heads often collapse to shortcut transports that ignore critical spatial cues and fail to fully exploit this capability. By correcting these shortcuts, DiG-Flow ties the flow head more tightly to the visual representation, so that the generated actions more faithfully reflect the spatial reasoning.