Tyler Griggs$^{1}$$^{\dagger}$, Sumanth Hegde$^{2}$, Eric Tang$^{2}$, Shu Liu$^{1}$, Shiyi Cao$^{1}$, Dacheng Li$^{1}$, Charlie Ruan$^{1}$*, Philipp Moritz$^{2}$, Kourosh Hakhamaneshi$^{2}$, Richard Liaw$^{2}$, Akshay Malik$^{2}$, **** Matei Zaharia$^{1}$, Joseph E. Gonzalez$^{1}$, Ion Stoica$^{1}$

$^{1}$University of California, Berkeley

$^{2}$Anyscale

$^{\dagger}$Project Lead

*Core Contributor

🗓️ Posted: June 26, 2025

<aside> 🏋️

SkyRL-v0.1: Making SkyRL Modular

In the original release of SkyRL, we introduced an agentic layer in the RL stack for multi-turn tool use LLMs, optimized for long-horizon, real-environment tasks like SWE-Bench.

Today, we are upgrading SkyRL to a highly-modular RL framework to train LLMs with the introduction of two key additions:

A modular, performant RL framework for training LLMs. SkyRL makes it easy to prototype new training algorithms, environments, and training execution plans — without compromising usability or speed.
A gymnasium of tool-use tasks with a simple environment interface and an ****out-of-the box library of popular tasks ****such as math, code, search, and SQL.

SkyRL’s modularity enables easy implementation of real-world improvements—like async training, heterogeneous hardware, and new environments — with under 100 LoC and up to 1.8× faster training.

Try it out at SkyRL on GitHub:

Screenshot 2025-06-24 at 12.02.02 AM.png

From the NovaSkyAI team at UC Berkeley Sky Computing Lab.

</aside>

Overview

Post-training is unlocking new capabilities in the development of LLMs. Reinforcement learning (RL) is arguably the most complex post-training workload as it includes producing a large volume of model responses (inference engines), executing actions generated by the model responses for a task (environments), and updating model parameters based on these actions’ rewards (trainers). The requirements for each of these components are diverse and, as the community rapidly explores and develops new RL techniques, the requirements continue to evolve.

We found that many existing RL frameworks tightly couple core components and lack clear interfaces for implementing custom logic. As a result, prototyping new ideas is often cumbersome—changing one part of the stack typically requires modifying several others. This rigidity makes it hard for developers to adapt to the evolving requirements of RL and to freely explore innovations across the stack, whether in algorithms, execution plans, or environments.

Based on this insight, modularity emerged as our top concern. We built SkyRL-v0.1 to be highly modular and facilitate flexible modification and extension of each layer of the RL framework.

Figure 1. High-level architecture of SkyRL-v0.1’s core components.

Figure 1. High-level architecture of SkyRL-v0.1’s core components.

SkyRL-v0.1 breaks the RL stack into modular components and provides well-defined public APIs for each of them. Specifically, as shown in Figure 1 above, we separate training into two major components, Trainer and Generator, and the Generator is further divided into InferenceEngine and Environment, with a single Controller managing setup and execution of each component. The components’ responsibilities are as follows:

Trainer: Performs the optimization steps based on configured RL algorithm. Updates model parameters based on generated trajectories and their assigned rewards.
Generator: Generates complete trajectories and computes their rewards. The Generator encompasses both the InferenceEngine (to get model completions) and Environment (to execute actions) as well as custom agentic or data generation logic build around model inference, such as context management, sampling methods, or tree search.
InferenceEngine: Executes inference on the policy model to produce model outputs (i.e., the RL agent’s actions). Typically, multiple InferenceEngines are deployed to process prompts in parallel.
Environment: Presents a task for the policy model to solve, and provides the logic for executing the policy’s actions (i.e., model outputs) and computing the resulting observations and rewards.
Controller: Manages physical placement, initialization, and control flow of training execution for each of the above components.