Robotics · VLA · Post-Training

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

A unified post-training framework that combines the sample efficiency of supervised fine-tuning with the robustness of reinforcement learning through on-policy distillation with Reverse-KL optimization.

Zhide Zhong¹ Haodong Yan¹ Junfeng Li¹ Junjie He¹ Tianran Zhang¹ Haoang Li¹

¹ The Hong Kong University of Science and Technology (Guangzhou)

📄 Paper 💻 Code (Coming Soon) 📝 Abstract 📈 Results

📜 Abstract

Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.

Vision-Language-Action Models Post-Training On-Policy Distillation Reverse-KL Divergence Robotic Manipulation

⭐ Key Contributions

🔗

Unified Post-Training Framework

VLA-OPD bridges SFT and RL by leveraging dense, token-level supervision on self-generated trajectories, resolving the exposure bias of SFT and the sample inefficiency of sparse-reward RL.

🔄

Reverse-KL Distillation Objective

A bounded mode-seeking objective that filters out the teacher's epistemic uncertainty while maintaining action diversity, preventing both entropy explosion (Forward-KL) and premature entropy collapse (Hard-CE).

🛡

Catastrophic Forgetting Mitigation

Gradient updates remain grounded in the student's active policy manifold, achieving gentle alignment that preserves pre-trained generalist capabilities during task-specific post-training.

🔬

Extensive Empirical Validation

Demonstrates superior robustness and success rates compared to SFT, requiring substantially fewer training steps than on-policy RL baselines across LIBERO and RoboTwin2.0 benchmarks.

📋 Training Paradigm Comparison

VLA-OPD uniquely combines the advantages of both SFT and RL while avoiding their respective drawbacks:

Paradigm	Sampling	Signal	Few-Demo	Convergence	Anti-Forgetting	Robustness
Offline SFT	Off-policy	Dense	×	Fast	×	×
Online RL	On-policy	Sparse	✓	Slow	✓	✓
VLA-OPD (Ours)	On-policy	Dense	✓	Fast	✓	✓

VLA-OPD achieves the best of both worlds: on-policy sampling with dense supervision for fast convergence and robust learning.

⚙ Method

Figure 1. Overview of VLA-OPD. The framework consists of three phases: (1) on-policy trajectory sampling by the student, (2) dense token-level labeling by the expert teacher, and (3) mode-seeking optimization via Reverse-KL policy gradient.

VLA-OPD operates through a three-phase iterative training process that bridges supervised learning and reinforcement learning:

On-Policy Sampling

The student VLA policy interacts with the environment to collect trajectory rollouts, explicitly exposing distribution shift at out-of-distribution states.

Dense Teacher Labeling

A frozen expert teacher provides dense, token-level action labels for each state visited by the student — no environmental execution required.

Mode-Seeking Optimization

The student is updated via on-policy policy gradient using token-level Reverse-KL reward with group-based variance reduction.

The optimization objective is defined as:

where the expectation over actions sampled from the student policy implicitly provides the probability weighting of the Reverse-KL divergence, and the per-sample reward is the negative log-ratio computed entirely from model log-probabilities.

⚖ Why Reverse-KL?

The choice of divergence direction critically affects training stability. VLA-OPD's Reverse-KL objective achieves stable, bounded entropy — avoiding the failure modes of alternative objectives:

Forward-KL

Entropy Explosion

Mode-covering behavior forces the student to spread over the entire teacher distribution, inducing catastrophic entropy explosion at OOD states.

Hard-CE (Argmax)

Entropy Collapse

Rigid top-1 matching causes premature entropy collapse, eliminating action diversity needed for exploration and robust behavior.

Reverse-KL (Ours)

Stable & Bounded

Zero-forcing property captures the teacher's primary intent while maintaining sufficient stochasticity. Elegantly avoids both extremes.

Success rate comparison across KL objectives

Actor entropy dynamics across KL objectives

📈 Experimental Results

LIBERO Benchmark (Single-Arm Manipulation)

Method	# Demos	Spatial	Object	Goal	Long	Avg.
Teacher (SimpleVLA-RL)	—	94.2	96.1	94.6	90.7	93.9
Octo	50	78.9	85.7	84.6	51.1	75.1
OpenVLA	50	84.7	88.4	79.2	53.7	76.5
Nora	50	92.2	95.4	89.4	74.6	87.9
π₀ + FAST	50	96.4	96.8	88.6	60.2	85.5
OpenVLA-OFT (Student Init.)	1	63.6	54.9	59.6	17.3	48.9
VLA-OPD (Distill)	1	84.3	93.8	92.5	78.9	87.4
VLA-OPD (Distill + GRPO)	1	93.4	95.3	94.5	90.2	93.4

With only 1 demonstration per task, VLA-OPD matches or surpasses full-dataset (50-demo) baselines, achieving a +38.5% improvement over the SFT initialization.

Training Efficiency

LIBERO-Object: Rapid convergence within 10 steps

LIBERO-Long: 3x faster convergence than GRPO

RoboTwin2.0 Benchmark (Dual-Arm Manipulation)

Method	Pick Dual Bottles	Place Empty Cup	Handover Block	Stack Bowls	Avg.
Teacher (SimpleVLA-RL)	68.3	94.2	57.8	75.8	74.0
π₀	50.0	60.0	39.0	53.0	50.5
RDT	18.0	42.0	26.0	42.0	32.0
OpenVLA-OFT (Student Init.)	29.7	77.3	33.1	40.6	45.2
VLA-OPD (Distill)	66.4	90.6	52.3	75.0	71.1

VLA-OPD generalizes to complex dual-arm manipulation, improving by +25.9% over initialization and nearly matching the expert teacher (71.1% vs 74.0%).

🧠 Catastrophic Forgetting Analysis

Offline SFT exhibits severe catastrophic forgetting: as seen-task success improves, unseen-task performance collapses. VLA-OPD's on-policy approach anchors gradient updates to the student's current behavioral manifold, preserving generalist capabilities.

Object Unseen Task (Setting 1)

Object Unseen Task (Setting 2)

Spatial Unseen Task (Setting 1)

Spatial Unseen Task (Setting 2)

⚙ Ablation Studies

We investigate the impact of group sampling size on training performance. Larger groups (G=8) provide the smoothest optimization, while even small groups (G=2) maintain competitive performance with reduced computational overhead.

Group sampling size vs. success rate

Group sampling size vs. KL divergence

📚 Citation

If you find this work useful, please consider citing our paper:

@article{zhong2025vlaopd,
  title     = {VLA-OPD: Bridging Offline SFT and Online RL for
               Vision-Language-Action Models via On-Policy Distillation},
  author    = {Zhong, Zhide and Yan, Haodong and Li, Junfeng and
               He, Junjie and Zhang, Tianran and Li, Haoang},
  year      = {2025},
  institution = {HKUST (Guangzhou)}
}