arxivApril 18, 2026 at 4:00 AM2 min readneutral

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

View PDF HTML (experimental) Abstract:Vision-Language-Action (VLA) models have emerged as a promising paradigm for building embodied agents that ground perception and language into action. However, most existing approaches rely on direct action prediction, lacking the ability to reason over long-horizon trajectories and evaluate their consequences, which limits performance in complex decision-making tasks. In this work, we introduce World-Value-Action (WAV) model, a unified framework that enables implicit planning in VLA systems. Rather than performing explicit trajectory optimization, WAV model learn a structured latent representation of future trajectories conditioned on visual observations and language instructions. A learned world model predicts future states, while a trajectory value function evaluates their long-horizon utility. Action generation is then formulated as inference in this latent space, where the model progressively concentrates probability mass on high-value and dynamically feasible trajectories. We provide a theoretical perspective showing that planning directly in action space suffers from an exponential decay in the probability of feasible trajectories as the horizon increases. In contrast, latent-space inference reshapes the search distribution toward feasible regions, enabling efficient long-horizon decision making. Extensive simulations and real-world experiments demonstrate that the WAV model consistently outperforms state-of-the-art methods, achieving significant improvements in task success rate, generalization ability, and robustness, especially in long-horizon and compositional scenarios. Subjects: Robotics (cs.RO); Machine Learning (cs.LG) Cite as: arXiv:2604.14732 [cs.RO] (or arXiv:2604.14732v1 [cs.RO] for this version) https://doi.org/10.48550/arXiv.2604.14732 arXiv-issued DOI via DataCite (pending registration) Submission history From: Runze Li [view email] [v1] Thu, 16 Apr 2026 07:46:05 UTC (6,871 KB)

Read original article ↗

No replies yet. Be first.

arxiv5h ago

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Related Articles

Three-Phase Transformer

The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment

Practical estimation of the optimal classification error with soft labels and calibration