Charlie Ruan, Tyler Griggs, Etash Guha, Benjamin Feuer, Alexander Shaw, Atula Tejaswi, Negin Raoof, Richard Zhuang, Ryan Marten, Boxuan Li, and the SkyRL Team, Harbor Team, and OpenThoughts-Agent Team

πŸ—“οΈ Posted: February 17, 2025

<aside>

We’re excited to release the official integration of SkyRL and Harbor, a standardized way to train terminal-use agents with reinforcement learning. With this integration, you can run RL on any set of verifiable terminal task with a single launch script.

TL;DR: SkyRL handles RL. Harbor handles agent execution, sandboxing, and verification. Together, they provide a turnkey stack for agentic RL on terminal tasks. Get started β†’

This post covers:

  1. Why training terminal-use agents with RL is hard
  2. How SkyRL and Harbor fit together
  3. The breadth of tasks you can train on with a single integration

https://github.com/NovaSky-AI/SkyRL

https://github.com/laude-institute/harbor

</aside>


1. Why RL for Terminal-Use is Hard

If you've tried training a terminal-use agent, you know the pain. A working setup requires getting many things right simultaneously, from training and sampling infra to sandbox scaling to algorithmic correctness. The surface area is large, and the subtle interactions between them are where things break.

Environment management. Each RL rollout needs a fresh sandboxed container spun up, wired to the agent, and torn down cleanly. You need to support one or more sandbox providers (Daytona, Modal), deal with image build times, and manage container lifecycles across hundreds of concurrent rollouts.

Error handling at every layer. Containers flake. Verifiers time out. The model exceeds its context length. The sandbox provider rate-limits you. Each failure mode requires a different response β€” some should be retried, some masked from training, some handled differently depending on your recipe. Getting this wrong means stalled runs or noisy gradients.

Keeping training on-policy. Agent frameworks often silently break the assumptions RL training relies on: summarizing long histories when nearing context limits, stripping thinking tokens, or orchestrating sub-agents with their own context management. Any of these can make your training effectively off-policy without you realizing it.

The barrier to agentic RL is not just the RL algorithm, but also everything around it.


2. SkyRL + Harbor: The Integration

Harbor is a widely-adopted agent evaluation framework, built by the creators of Terminal-Bench. It abstracts away sandbox management, the agent loop, and rollout monitoring. A Harbor task is a simple directory:

task-dir/
  instruction.md   # Natural language task description
  environment/
    Dockerfile      # Container image
  tests/
    test.sh         # Verification script β†’ writes reward

Harbor handles the full trial lifecycle: spinning up the sandbox, running the agent, verifying the result, and tearing everything down. The same agent logic is used across SFT trace generation, evaluation, and RL training, so there is no drift between the agent you evaluate and the agent you train.

SkyRL is a modular RL library for LLMs. Its architecture cleanly separates the Trainer (policy optimization) from the Generator (trajectory generation), connected by a minimal interface:

class GeneratorInterface(ABC):
    @abstractmethod
    async def generate(self, input_batch: GeneratorInput) -> GeneratorOutput:
        ...

The Harbor integration implements this interface with a HarborGenerator that runs Harbor trials and converts the results β€” chat histories and rewards β€” into the tokenized format SkyRL's trainer expects. No changes to the core SkyRL training loop are needed:

SkyRL Training Loop (unchanged)
       β”‚
       β–Ό
HarborGenerator (implements GeneratorInterface)
       β”‚
       β–Ό
Harbor Trial  β†’  sandbox + agent + verifier  β†’  chat history & reward

Harbor owns agent execution and sandboxing. SkyRL owns training. Neither needs to know the internals of the other.