On-Policy Distillation for Reasoning with SkyRL

In this report, we show that it’s quick and simple to implement and run on-policy distillation in SkyRL!

First: we walk through the implementation of on-policy distillation in SkyRL
- A single example script with no modifications to the library
Next: we reproduce results from Thinking Machine’s On-Policy Distillation blog post, walking through the experiment details of using on-policy distillation for math reasoning

For more details, check out the following:

📝 Complete reproduction scripts

📈 WandB report for our training runs

Try it out at SkyRL on GitHub!

</aside>

On-Policy Distillation

On-Policy Distillation is a technique recently highlighted by Thinking Machines, showcasing prior research from Agarwal et al, Gu et al, and the Qwen3 team, which combines the benefits of on-policy RL style training with the dense reward signal of distillation. The main idea is to collect on-policy samples from a student model, use a teacher to grade each token from the student samples, update the student policy accordingly, and repeat.

Implementing On-Policy Distillation

Supporting On-Policy Distillation in SkyRL is simple, requiring only a replacement of the reference model with a teacher model, and modifying the reward and loss calculations for the trainer.

The simple PR adding an example supporting on-policy distillation in SkyRL can be found in: https://github.com/NovaSky-AI/SkyRL/pull/585. We walk through the example script from the PR (which required no modifications to the core SkyRL library) below.

Configure Teacher Model

First, we can set the reference model path to point to the teacher model. WithQwen3-32B and Qwen3-4B-Base as student, we set the reference and policy model paths as follows:

trainer.policy.model.path="Qwen/Qwen3-4B-Base"
trainer.ref.model.path="Qwen/Qwen3-32B"

Reverse KL

Next, we need to set the reward for our RL trainer to use the reverse KL.

In SkyRL, we can do this by overriding the apply_reward_kl_penalty in the trainer class RayPPOTrainer(code link), which is the base trainer class for all PPO related algorithms in SkyRL (GRPO, DAPO, etc.).

Overview of the Generator and Trainer dataflow in SkyRL. Modifications needed for On-Policy Distillation highlighted in red.

We show the complete code snippet below:

import torch
from skyrl_train.trainer import RayPPOTrainer
from skyrl_train.training_batch import TrainingInputBatch

class OnPolicyDistillationTrainer(RayPPOTrainer):
    """
    Custom trainer for On Policy Distillation.

    Overrides the apply_reward_kl_penalty method to set rewards to the kl penalty
    """

    def apply_reward_kl_penalty(
        self,
        data: TrainingInputBatch,
    ) -> TrainingInputBatch:
        """Computes the KL penalty and sets the rewards to the KL penalty"""
        loss_masks_all: torch.Tensor = data["loss_mask"]
        teacher_action_log_probs: torch.Tensor = data["base_action_log_probs"]
        action_log_probs: torch.Tensor = data["action_log_probs"]

        # set rewards to the KL penalty
        rewards = -(action_log_probs - teacher_action_log_probs) * loss_masks_all
        data["rewards"] = rewards
        return data

Advantage Estimator & Policy Loss

Next, we modify the advantage estimator for SkyRL to be a no-op, just setting the advantages and returns to be the negative reverse KL that we computed in in apply_reward_kl_penalty above. This is made easy using the simple registry system for creating custom advantage estimators and policy losses in SkyRL. We just need to add the following code to our training script: