FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

Abstract

Many Vision-Language-Action (VLA) models rely on an internal world model trained via next-frame prediction. This approach, however, struggles with physical reasoning as it entangles static appearance with dynamic motion, often resulting in implausible visual forecasts and inefficient policy learning. To address these limitations, we introduce the Visual Chain of Thought (Visual CoT): a pre-training framework that encourages a model to reason about how a scene evolves before predicting what it will look like. We instantiate this principle in FlowVLA, which predicts a future frame (v_t+1) only after generating an intermediate optical flow representation (f_t) that encodes motion dynamics. This "v_t → f_t → v_t+1" reasoning process is implemented within a single autoregressive Transformer, guiding the model to learn disentangled dynamics. As a result, FlowVLA produces coherent visual predictions and facilitates more efficient policy learning. Experiments on challenging robotics manipulation benchmarks demonstrate state-of-the-art performance with substantially improved sample efficiency, pointing toward a more principled foundation for world modeling.

Approach

FlowVLA implements the Visual CoT principle within a single, unified Transformer. We achieve this with a simple yet powerful design by encoding both RGB frames (appearance) and optical flow fields (motion) into the same token vocabulary using a shared VQ-GAN. The model then learns an interleaved sequence of [frame, flow, frame, flow, ...], explicitly forcing it to predict motion before predicting the next state.

The FlowVLA Framework: (Left) Pre-training with Visual CoT. (Right) Fine-tuning for robotic control.

World Model Rollouts: Seeing is Believing

Our analysis on the challenging Bridge V2 dataset reveals two critical failure modes in the baseline model: physical incoherence and semantic inconsistency. FlowVLA successfully addresses both.

Failure Mode 1: Physical Incoherence

Example of physical incoherence in baseline model

The baseline model (UniVLA, middle) fails to maintain physical plausibility, causing the robotic arm to vanish or the object to move erratically. FlowVLA (bottom) remains stable and coherent.

Failure Mode 2: Semantic Inconsistency

Example of semantic inconsistency in baseline model

Here, the baseline's prediction appears visually plausible but fails to follow the language command. FlowVLA correctly interprets the instruction and generates the corresponding action.

Benchmark Performance

FlowVLA establishes a new state-of-the-art on two challenging robotics benchmarks, demonstrating the effectiveness of our approach.

LIBERO Benchmark

FlowVLA outperforms prior methods across all task suites, with the largest gains in the Long-horizon setting.

SimplerEnv Benchmark

FlowVLA shows superior robustness to significant visual domain shifts, especially on "Stack Block".

Superior Sample Efficiency

A key benefit of our Visual CoT is dramatically improved sample efficiency. FlowVLA learns faster and better, especially when data is scarce. In the low-data regime (right), it achieves a 55% higher success rate than the baseline.

Ablation Studies

Our design choices are critical for success. The results confirm that the full Visual CoT structure, direct supervision on flow, and the interleaved sequence format are all essential for FlowVLA's performance. Removing any key component leads to a significant drop in success rate.

BibTeX

@article{zhong2025flowvla,
  author    = {Zhide Zhong and Haodong Yan and Junfeng Li and Xiangchen Liu and Xin Gong and Tianran Zhang and Wenxuan Song and Jiayi Chen and Xinhu Zheng and Hesheng Wang and Haoang Li},
  title     = {FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models},
  journal   = {arXiv preprint arXiv:2508.18269},
  year      = {2025}
}