Shiyi Cao$^{1}$$^{\dagger}$, Sumanth Hegde$^{2}$, Dacheng Li$^{1}$, Tyler Griggs$^{1}$, Shu Liu$^{1}$, Eric Tang$^{2}$, Jiayi Pan$^{1}$, Xingyao Wang$^{3}$, Akshay Malik$^{2}$, Graham Neubig$^{3,4}$, Kourosh Hakhamaneshi$^{2}$, Richard Liaw$^{2}$, Philipp Moritz$^{2}$, Matei Zaharia$^{1}$, Joseph E. Gonzalez$^{1}$, Ion Stoica$^{1}$

$^{1}$University of California, Berkeley

$^{2}$Anyscale

$^{3}$All Hands AI

$^{4}$Carnegie Mellon University

$^{\dagger}$Project Lead

*Core Contributor

<aside> 🚀

SkyRL — Online RL Training for Real-World Long-Horizon Agents

Most existing RL frameworks are optimized for tasks that involve stateless interactions over short horizons, such as search-augmented reasoning or simple code execution. In contrast, real-world tasks, like those represented in SWE-Bench, benefit from long-horizon planning in stateful, dynamic environments. This presents new challenges in both infrastructure and training algorithms.

We introduce SkyRL, our RL training pipeline for multi-turn tool use LLMs, optimized for long-horizon, real-environment tasks like SWE-Bench, built on top of VeRL and OpenHands.

Using SkyRL, we are able to achieve promising results on SWE-Bench-Verified across model lines, using around 300 samples of training data!

🧠 SkyRL-Agent-7B-v0 from OpenHands-7B-Agent: 11.0% → 14.6%

🐉 SkyRL-Agent-8B-v0 from Qwen3-8B (no thinking): 3.6% → 9.4%

🔍 SkyRL-Agent-14B-v0 from Qwen3-14B (thinking-enabled): 18.0% → 21.6%

SkyRL-v0 is just the beginning. Try it out at: Github

</aside>

Screenshot 2025-05-06 at 10.38.46 PM.png

Figure 1: SkyRL builds on top of VeRL, inheriting its rich support for learning algorithms. SkyRL extends VeRL by introducing the agent layer: (1) Efficient asynchronous multi-turn rollouts, (2) Generic tool use, and (3) Generic and scalable environment execution.

Overview

Recent progress in reinforcement learning has enabled language models to become active agents. Recent open-source frameworks such as Search-R1 and **ToRL** (built on top of ****VeRL) have made impressive strides in this direction, enabling multi-turn RL with interleaved single tool use such as search or code execution. These systems have laid the essential groundwork for tool-augmented reasoning. However, complicated real-world tasks such as SWE-Bench, WebDev, and Web Browsing require advanced agentic ability where the models need to invoke multiple tools, write and run tests, react to environment feedback, and execute long-horizon plans.

Figure 2: An Example Workflow for Solving SWE-Bench Issues with CodeAct Agent.

Figure 2: An Example Workflow for Solving SWE-Bench Issues with CodeAct Agent.

While these more advanced agents mark an exciting evolution, running online reinforcement learning on them is highly challenging. First, an efficient framework for such training requires fast environment execution and efficient rollout with environment interactions. Second, robust long-horizon algorithms are needed for effective training (not the focus of this blog). Altogether, this makes the problem orders of magnitude more complex than training prior tool-augmented reasoning LLMs.

To fill this gap, we introduce SkyRL—our RL training pipeline for multi-turn tool use LLMs on long-horizon tasks in complex environments, including SWE-Bench, built on top of VeRL and OpenHands.

SkyRL features: