by Eric Tang and the NovaSky Team
๐๏ธ Posted: November 7, 2025
<aside> ๐งโ๐ซ
In this report, we show that itโs quick and simple to implement and run on-policy distillation in SkyRL!
For more details, check out the following:
๐ย Complete reproduction scripts
๐ย WandB report for our training runs
Try it out at SkyRL on GitHub!

</aside>
On-Policy Distillation is a technique recently highlighted by Thinking Machines, showcasing prior research from Agarwal et al,ย Gu et al, and theย Qwen3 team, which combines the benefits of on-policy RL style training with the dense reward signal of distillation. The main idea is to collect on-policy samples from a student model, use a teacher to grade each token from the student samples, update the student policy accordingly, and repeat.
Supporting On-Policy Distillation in SkyRL is simple, requiring only a replacement of the reference model with a teacher model, and modifying the reward and loss calculations for the trainer.
The simple PR adding an example supporting on-policy distillation in SkyRL can be found in: https://github.com/NovaSky-AI/SkyRL/pull/585. We walk through the example script from the PR (which required no modifications to the core SkyRL library) below.
First, we can set the reference model path to point to the teacher model. WithQwen3-32B and Qwen3-4B-Base as student, we set the reference and policy model paths as follows:
trainer.policy.model.path="Qwen/Qwen3-4B-Base"
trainer.ref.model.path="Qwen/Qwen3-32B"
Next, we need to set the reward for our RL trainer to use the reverse KL.

In SkyRL, we can do this by overriding the apply_reward_kl_penalty in the trainer class RayPPOTrainer(code link), which is the base trainer class for all PPO related algorithms in SkyRL (GRPO, DAPO, etc.).
.png)
Overview of the Generator and Trainer dataflow in SkyRL. Modifications needed for On-Policy Distillation highlighted in red.
We show the complete code snippet below:
import torch
from skyrl_train.trainer import RayPPOTrainer
from skyrl_train.training_batch import TrainingInputBatch
class OnPolicyDistillationTrainer(RayPPOTrainer):
"""
Custom trainer for On Policy Distillation.
Overrides the apply_reward_kl_penalty method to set rewards to the kl penalty
"""
def apply_reward_kl_penalty(
self,
data: TrainingInputBatch,
) -> TrainingInputBatch:
"""Computes the KL penalty and sets the rewards to the KL penalty"""
loss_masks_all: torch.Tensor = data["loss_mask"]
teacher_action_log_probs: torch.Tensor = data["base_action_log_probs"]
action_log_probs: torch.Tensor = data["action_log_probs"]
# set rewards to the KL penalty
rewards = -(action_log_probs - teacher_action_log_probs) * loss_masks_all
data["rewards"] = rewards
return data
Next, we modify the advantage estimator for SkyRL to be a no-op, just setting the advantages and returns to be the negative reverse KL that we computed in in apply_reward_kl_penalty above. This is made easy using the simple registry system for creating custom advantage estimators and policy losses in SkyRL. We just need to add the following code to our training script: