FlowVLA: Thinking in Motion with a Visual Chain of Thought

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Wenxuan Song, Jiayi Chen, Haoang Li
HKUST(GZ)
FlowVLA Core Concept

We teach models to first reason about how a scene will move (motion), before predicting what it will look like (appearance). This simple principle unlocks more physically realistic world models and more efficient robot learning.

Abstract

Many Vision-Language-Action (VLA) models rely on an internal world model trained via next-frame prediction. This approach, however, struggles with physical reasoning as it entangles static appearance with dynamic motion, often resulting in implausible visual forecasts and inefficient policy learning. To address these limitations, we introduce the Visual Chain of Thought (Visual CoT): a pre-training framework that encourages a model to reason about how a scene evolves before predicting what it will look like. We instantiate this principle in FlowVLA, which predicts a future frame (vt+1) only after generating an intermediate optical flow representation (ft) that encodes motion dynamics. This "vt → ft → vt+1" reasoning process is implemented within a single autoregressive Transformer, guiding the model to learn disentangled dynamics. As a result, FlowVLA produces coherent visual predictions and facilitates more efficient policy learning. Experiments on challenging robotics manipulation benchmarks demonstrate state-of-the-art performance with substantially improved sample efficiency, pointing toward a more principled foundation for world modeling.

Approach

FlowVLA implements the Visual CoT principle within a single, unified Transformer. We achieve this with a simple yet powerful design by encoding both RGB frames (appearance) and optical flow fields (motion) into the same token vocabulary using a shared VQ-GAN. The model then learns an interleaved sequence of [frame, flow, frame, flow, ...], explicitly forcing it to predict motion before predicting the next state.

FlowVLA Framework Architecture

The FlowVLA Framework: (Left) Pre-training with Visual CoT. (Right) Fine-tuning for robotic control.

World Model Rollouts: Seeing is Believing

Our analysis on the challenging Bridge V2 dataset reveals two critical failure modes in the baseline model: physical incoherence and semantic inconsistency. FlowVLA successfully addresses both.

Failure Mode 1: Physical Incoherence

Example of physical incoherence in baseline model

The baseline model (UniVLA, middle) fails to maintain physical plausibility, causing the robotic arm to vanish or the object to move erratically. FlowVLA (bottom) remains stable and coherent.


Failure Mode 2: Semantic Inconsistency

Example of semantic inconsistency in baseline model

Here, the baseline's prediction appears visually plausible but fails to follow the language command. FlowVLA correctly interprets the instruction and generates the corresponding action.

Benchmark Performance

FlowVLA establishes a new state-of-the-art on two challenging robotics benchmarks, demonstrating the effectiveness of our approach.

LIBERO Benchmark

LIBERO Benchmark Results

FlowVLA outperforms prior methods across all task suites, with the largest gains in the Long-horizon setting.


SimplerEnv Benchmark

SimplerEnv Benchmark Results

FlowVLA shows superior robustness to significant visual domain shifts, especially on "Stack Block".

Superior Sample Efficiency

A key benefit of our Visual CoT is dramatically improved sample efficiency. FlowVLA learns faster and better, especially when data is scarce. In the low-data regime (right), it achieves a 55% higher success rate than the baseline.

Training efficiency comparison

Ablation Studies

Our design choices are critical for success. The results confirm that the full Visual CoT structure, direct supervision on flow, and the interleaved sequence format are all essential for FlowVLA's performance. Removing any key component leads to a significant drop in success rate.

Ablation study results

BibTeX

@article{zhong2025flowvla,
  author    = {Zhong, Zhide and Yan, Haodong and Li, Junfeng and Liu, Xiangchen and Gong, Xin and Song, Wenxuan and Chen, Jiayi and Li, Haoang},
  title     = {FlowVLA: Thinking in Motion with a Visual Chain of Thought},
  journal   = {arXiv preprint arXiv:XXXX.XXXXX},
  year      = {2025}
}