Many Vision-Language-Action (VLA) models rely on an internal world model trained via next-frame prediction. This approach, however, struggles with physical reasoning as it entangles static appearance with dynamic motion, often resulting in implausible visual forecasts and inefficient policy learning. To address these limitations, we introduce the Visual Chain of Thought (Visual CoT): a pre-training framework that encourages a model to reason about how a scene evolves before predicting what it will look like. We instantiate this principle in FlowVLA, which predicts a future frame (vt+1) only after generating an intermediate optical flow representation (ft) that encodes motion dynamics. This "vt → ft → vt+1" reasoning process is implemented within a single autoregressive Transformer, guiding the model to learn disentangled dynamics. As a result, FlowVLA produces coherent visual predictions and facilitates more efficient policy learning. Experiments on challenging robotics manipulation benchmarks demonstrate state-of-the-art performance with substantially improved sample efficiency, pointing toward a more principled foundation for world modeling.
FlowVLA implements the Visual CoT principle within a single, unified Transformer. We achieve this with a simple yet powerful design by encoding both RGB frames (appearance) and optical flow fields (motion) into the same token vocabulary using a shared VQ-GAN. The model then learns an interleaved sequence of [frame, flow, frame, flow, ...]
, explicitly forcing it to predict motion before predicting the next state.
Our analysis on the challenging Bridge V2 dataset reveals two critical failure modes in the baseline model: physical incoherence and semantic inconsistency. FlowVLA successfully addresses both.
The baseline model (UniVLA, middle) fails to maintain physical plausibility, causing the robotic arm to vanish or the object to move erratically. FlowVLA (bottom) remains stable and coherent.
Here, the baseline's prediction appears visually plausible but fails to follow the language command. FlowVLA correctly interprets the instruction and generates the corresponding action.
FlowVLA establishes a new state-of-the-art on two challenging robotics benchmarks, demonstrating the effectiveness of our approach.
FlowVLA outperforms prior methods across all task suites, with the largest gains in the Long-horizon setting.
FlowVLA shows superior robustness to significant visual domain shifts, especially on "Stack Block".
A key benefit of our Visual CoT is dramatically improved sample efficiency. FlowVLA learns faster and better, especially when data is scarce. In the low-data regime (right), it achieves a 55% higher success rate than the baseline.
Our design choices are critical for success. The results confirm that the full Visual CoT structure, direct supervision on flow, and the interleaved sequence format are all essential for FlowVLA's performance. Removing any key component leads to a significant drop in success rate.
@article{zhong2025flowvla,
author = {Zhong, Zhide and Yan, Haodong and Li, Junfeng and Liu, Xiangchen and Gong, Xin and Song, Wenxuan and Chen, Jiayi and Li, Haoang},
title = {FlowVLA: Thinking in Motion with a Visual Chain of Thought},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2025}
}