Robotics · VLA · Post-Training

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

A unified post-training framework that combines the sample efficiency of supervised fine-tuning with the robustness of reinforcement learning through on-policy distillation with Reverse-KL optimization.

Zhide Zhong1 Haodong Yan1 Junfeng Li1 Junjie He1 Tianran Zhang1 Haoang Li1

1 The Hong Kong University of Science and Technology (Guangzhou)

📜 Abstract

Although pre-trained Vision-Language-Action (VLA) models exhibit impressive generalization in robotic manipulation, post-training remains crucial to ensure reliable performance during deployment. However, standard offline Supervised Fine-Tuning (SFT) suffers from distribution shifts and catastrophic forgetting of pre-trained capabilities, while online Reinforcement Learning (RL) struggles with sparse rewards and poor sample efficiency. In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories. This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment. Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity. Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.

Vision-Language-Action Models Post-Training On-Policy Distillation Reverse-KL Divergence Robotic Manipulation

Key Contributions

🔗

Unified Post-Training Framework

VLA-OPD bridges SFT and RL by leveraging dense, token-level supervision on self-generated trajectories, resolving the exposure bias of SFT and the sample inefficiency of sparse-reward RL.

🔄

Reverse-KL Distillation Objective

A bounded mode-seeking objective that filters out the teacher's epistemic uncertainty while maintaining action diversity, preventing both entropy explosion (Forward-KL) and premature entropy collapse (Hard-CE).

🛡

Catastrophic Forgetting Mitigation

Gradient updates remain grounded in the student's active policy manifold, achieving gentle alignment that preserves pre-trained generalist capabilities during task-specific post-training.

🔬

Extensive Empirical Validation

Demonstrates superior robustness and success rates compared to SFT, requiring substantially fewer training steps than on-policy RL baselines across LIBERO and RoboTwin2.0 benchmarks.

📋 Training Paradigm Comparison

VLA-OPD uniquely combines the advantages of both SFT and RL while avoiding their respective drawbacks:

ParadigmSamplingSignalFew-DemoConvergenceAnti-ForgettingRobustness
Offline SFTOff-policyDense×Fast××
Online RLOn-policySparseSlow
VLA-OPD (Ours)On-policyDenseFast

VLA-OPD achieves the best of both worlds: on-policy sampling with dense supervision for fast convergence and robust learning.

Method

VLA-OPD Framework Overview
Figure 1. Overview of VLA-OPD. The framework consists of three phases: (1) on-policy trajectory sampling by the student, (2) dense token-level labeling by the expert teacher, and (3) mode-seeking optimization via Reverse-KL policy gradient.

VLA-OPD operates through a three-phase iterative training process that bridges supervised learning and reinforcement learning:

1

On-Policy Sampling

The student VLA policy interacts with the environment to collect trajectory rollouts, explicitly exposing distribution shift at out-of-distribution states.

2

Dense Teacher Labeling

A frozen expert teacher provides dense, token-level action labels for each state visited by the student — no environmental execution required.

3

Mode-Seeking Optimization

The student is updated via on-policy policy gradient using token-level Reverse-KL reward with group-based variance reduction.

The optimization objective is defined as:

reward formula

where the expectation over actions sampled from the student policy implicitly provides the probability weighting of the Reverse-KL divergence, and the per-sample reward is the negative log-ratio computed entirely from model log-probabilities.

Why Reverse-KL?

The choice of divergence direction critically affects training stability. VLA-OPD's Reverse-KL objective achieves stable, bounded entropy — avoiding the failure modes of alternative objectives:

Forward-KL

Entropy Explosion

Mode-covering behavior forces the student to spread over the entire teacher distribution, inducing catastrophic entropy explosion at OOD states.

Hard-CE (Argmax)

Entropy Collapse

Rigid top-1 matching causes premature entropy collapse, eliminating action diversity needed for exploration and robust behavior.

Reverse-KL (Ours)

Stable & Bounded

Zero-forcing property captures the teacher's primary intent while maintaining sufficient stochasticity. Elegantly avoids both extremes.

KL Ablation - Success Rate
Success rate comparison across KL objectives
KL Ablation - Entropy
Actor entropy dynamics across KL objectives

📈 Experimental Results

LIBERO Benchmark (Single-Arm Manipulation)

Method# DemosSpatialObjectGoalLongAvg.
Teacher (SimpleVLA-RL)94.296.194.690.793.9
Octo5078.985.784.651.175.1
OpenVLA5084.788.479.253.776.5
Nora5092.295.489.474.687.9
π0 + FAST5096.496.888.660.285.5
OpenVLA-OFT (Student Init.)163.654.959.617.348.9
VLA-OPD (Distill)184.393.892.578.987.4
VLA-OPD (Distill + GRPO)193.495.394.590.293.4

With only 1 demonstration per task, VLA-OPD matches or surpasses full-dataset (50-demo) baselines, achieving a +38.5% improvement over the SFT initialization.

Training Efficiency

Training Efficiency - LIBERO Object
LIBERO-Object: Rapid convergence within 10 steps
Training Efficiency - LIBERO Long
LIBERO-Long: 3x faster convergence than GRPO

RoboTwin2.0 Benchmark (Dual-Arm Manipulation)

MethodPick Dual BottlesPlace Empty CupHandover BlockStack BowlsAvg.
Teacher (SimpleVLA-RL)68.394.257.875.874.0
π050.060.039.053.050.5
RDT18.042.026.042.032.0
OpenVLA-OFT (Student Init.)29.777.333.140.645.2
VLA-OPD (Distill)66.490.652.375.071.1

VLA-OPD generalizes to complex dual-arm manipulation, improving by +25.9% over initialization and nearly matching the expert teacher (71.1% vs 74.0%).

🧠 Catastrophic Forgetting Analysis

Offline SFT exhibits severe catastrophic forgetting: as seen-task success improves, unseen-task performance collapses. VLA-OPD's on-policy approach anchors gradient updates to the student's current behavioral manifold, preserving generalist capabilities.

Forgetting - Object 1
Object Unseen Task (Setting 1)
Forgetting - Object 2
Object Unseen Task (Setting 2)
Forgetting - Spatial 1
Spatial Unseen Task (Setting 1)
Forgetting - Spatial 2
Spatial Unseen Task (Setting 2)

Ablation Studies

We investigate the impact of group sampling size on training performance. Larger groups (G=8) provide the smoothest optimization, while even small groups (G=2) maintain competitive performance with reduced computational overhead.

Group Size - Success Rate
Group sampling size vs. success rate
Group Size - KL
Group sampling size vs. KL divergence

📚 Citation

If you find this work useful, please consider citing our paper:

@article{zhong2025vlaopd,
  title     = {VLA-OPD: Bridging Offline SFT and Online RL for
               Vision-Language-Action Models via On-Policy Distillation},
  author    = {Zhong, Zhide and Yan, Haodong and Li, Junfeng and
               He, Junjie and Zhang, Tianran and Li, Haoang},
  year      = {2025},
  institution = {HKUST (Guangzhou)}
}