A unified post-training framework that combines the sample efficiency of supervised fine-tuning with the robustness of reinforcement learning through on-policy distillation with Reverse-KL optimization.
1 The Hong Kong University of Science and Technology (Guangzhou)
Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.
VLA-OPD bridges SFT and RL by leveraging dense, token-level supervision on self-generated trajectories, resolving the exposure bias of SFT and the sample inefficiency of sparse-reward RL.
A bounded mode-seeking objective that filters out the teacher's epistemic uncertainty while maintaining action diversity, preventing both entropy explosion (Forward-KL) and premature entropy collapse (Hard-CE).
Gradient updates remain grounded in the student's active policy manifold, achieving gentle alignment that preserves pre-trained generalist capabilities during task-specific post-training.
Demonstrates superior robustness and success rates compared to SFT, requiring substantially fewer training steps than on-policy RL baselines across LIBERO and RoboTwin2.0 benchmarks.
VLA-OPD uniquely combines the advantages of both SFT and RL while avoiding their respective drawbacks:
| Paradigm | Sampling | Signal | Few-Demo | Convergence | Anti-Forgetting | Robustness |
|---|---|---|---|---|---|---|
| Offline SFT | Off-policy | Dense | × | Fast | × | × |
| Online RL | On-policy | Sparse | ✓ | Slow | ✓ | ✓ |
| VLA-OPD (Ours) | On-policy | Dense | ✓ | Fast | ✓ | ✓ |
VLA-OPD achieves the best of both worlds: on-policy sampling with dense supervision for fast convergence and robust learning.
VLA-OPD operates through a three-phase iterative training process that bridges supervised learning and reinforcement learning:
The student VLA policy interacts with the environment to collect trajectory rollouts, explicitly exposing distribution shift at out-of-distribution states.
A frozen expert teacher provides dense, token-level action labels for each state visited by the student — no environmental execution required.
The student is updated via on-policy policy gradient using token-level Reverse-KL reward with group-based variance reduction.
The optimization objective is defined as:
where the expectation over actions sampled from the student policy implicitly provides the probability weighting of the Reverse-KL divergence, and the per-sample reward is the negative log-ratio computed entirely from model log-probabilities.
The choice of divergence direction critically affects training stability. VLA-OPD's Reverse-KL objective achieves stable, bounded entropy — avoiding the failure modes of alternative objectives:
Mode-covering behavior forces the student to spread over the entire teacher distribution, inducing catastrophic entropy explosion at OOD states.
Rigid top-1 matching causes premature entropy collapse, eliminating action diversity needed for exploration and robust behavior.
Zero-forcing property captures the teacher's primary intent while maintaining sufficient stochasticity. Elegantly avoids both extremes.
| Method | # Demos | Spatial | Object | Goal | Long | Avg. |
|---|---|---|---|---|---|---|
| Teacher (SimpleVLA-RL) | — | 94.2 | 96.1 | 94.6 | 90.7 | 93.9 |
| Octo | 50 | 78.9 | 85.7 | 84.6 | 51.1 | 75.1 |
| OpenVLA | 50 | 84.7 | 88.4 | 79.2 | 53.7 | 76.5 |
| Nora | 50 | 92.2 | 95.4 | 89.4 | 74.6 | 87.9 |
| π0 + FAST | 50 | 96.4 | 96.8 | 88.6 | 60.2 | 85.5 |
| OpenVLA-OFT (Student Init.) | 1 | 63.6 | 54.9 | 59.6 | 17.3 | 48.9 |
| VLA-OPD (Distill) | 1 | 84.3 | 93.8 | 92.5 | 78.9 | 87.4 |
| VLA-OPD (Distill + GRPO) | 1 | 93.4 | 95.3 | 94.5 | 90.2 | 93.4 |
With only 1 demonstration per task, VLA-OPD matches or surpasses full-dataset (50-demo) baselines, achieving a +38.5% improvement over the SFT initialization.
| Method | Pick Dual Bottles | Place Empty Cup | Handover Block | Stack Bowls | Avg. |
|---|---|---|---|---|---|
| Teacher (SimpleVLA-RL) | 68.3 | 94.2 | 57.8 | 75.8 | 74.0 |
| π0 | 50.0 | 60.0 | 39.0 | 53.0 | 50.5 |
| RDT | 18.0 | 42.0 | 26.0 | 42.0 | 32.0 |
| OpenVLA-OFT (Student Init.) | 29.7 | 77.3 | 33.1 | 40.6 | 45.2 |
| VLA-OPD (Distill) | 66.4 | 90.6 | 52.3 | 75.0 | 71.1 |
VLA-OPD generalizes to complex dual-arm manipulation, improving by +25.9% over initialization and nearly matching the expert teacher (71.1% vs 74.0%).
Offline SFT exhibits severe catastrophic forgetting: as seen-task success improves, unseen-task performance collapses. VLA-OPD's on-policy approach anchors gradient updates to the student's current behavioral manifold, preserving generalist capabilities.




We investigate the impact of group sampling size on training performance. Larger groups (G=8) provide the smoothest optimization, while even small groups (G=2) maintain competitive performance with reduced computational overhead.


If you find this work useful, please consider citing our paper:
@article{zhong2025vlaopd,
title = {VLA-OPD: Bridging Offline SFT and Online RL for
Vision-Language-Action Models via On-Policy Distillation},
author = {Zhong, Zhide and Yan, Haodong and Li, Junfeng and
He, Junjie and Zhang, Tianran and Li, Haoang},
year = {2025},
institution = {HKUST (Guangzhou)}
}