Charlie Ruan, Tyler Griggs, Etash Guha, Benjamin Feuer, Alexander Shaw, Atula Tejaswi, Negin Raoof, Richard Zhuang, Ryan Marten, Boxuan Li, and the SkyRL Team, Harbor Team, and OpenThoughts-Agent Team
ποΈ Posted: February 17, 2025
<aside>
Weβre excited to release the official integration of SkyRL and Harbor, a standardized way to train terminal-use agents with reinforcement learning. With this integration, you can run RL on any set of verifiable terminal task with a single launch script.
TL;DR: SkyRL handles RL. Harbor handles agent execution, sandboxing, and verification. Together, they provide a turnkey stack for agentic RL on terminal tasks. Get started β
This post covers:
https://github.com/NovaSky-AI/SkyRL
https://github.com/laude-institute/harbor
</aside>
If you've tried training a terminal-use agent, you know the pain. A working setup requires getting many things right simultaneously, from training and sampling infra to sandbox scaling to algorithmic correctness. The surface area is large, and the subtle interactions between them are where things break.
Environment management. Each RL rollout needs a fresh sandboxed container spun up, wired to the agent, and torn down cleanly. You need to support one or more sandbox providers (Daytona, Modal), deal with image build times, and manage container lifecycles across hundreds of concurrent rollouts.
Error handling at every layer. Containers flake. Verifiers time out. The model exceeds its context length. The sandbox provider rate-limits you. Each failure mode requires a different response β some should be retried, some masked from training, some handled differently depending on your recipe. Getting this wrong means stalled runs or noisy gradients.
Keeping training on-policy. Agent frameworks often silently break the assumptions RL training relies on: summarizing long histories when nearing context limits, stripping thinking tokens, or orchestrating sub-agents with their own context management. Any of these can make your training effectively off-policy without you realizing it.
The barrier to agentic RL is not just the RL algorithm, but also everything around it.
Harbor is a widely-adopted agent evaluation framework, built by the creators of Terminal-Bench. It abstracts away sandbox management, the agent loop, and rollout monitoring. A Harbor task is a simple directory:
task-dir/
instruction.md # Natural language task description
environment/
Dockerfile # Container image
tests/
test.sh # Verification script β writes reward
Harbor handles the full trial lifecycle: spinning up the sandbox, running the agent, verifying the result, and tearing everything down. The same agent logic is used across SFT trace generation, evaluation, and RL training, so there is no drift between the agent you evaluate and the agent you train.
SkyRL is a modular RL library for LLMs. Its architecture cleanly separates the Trainer (policy optimization) from the Generator (trajectory generation), connected by a minimal interface:
class GeneratorInterface(ABC):
@abstractmethod
async def generate(self, input_batch: GeneratorInput) -> GeneratorOutput:
...
The Harbor integration implements this interface with a HarborGenerator that runs Harbor trials and converts the results β chat histories and rewards β into the tokenized format SkyRL's trainer expects. No changes to the core SkyRL training loop are needed:
SkyRL Training Loop (unchanged)
β
βΌ
HarborGenerator (implements GeneratorInterface)
β
βΌ
Harbor Trial β sandbox + agent + verifier β chat history & reward
Harbor owns agent execution and sandboxing. SkyRL owns training. Neither needs to know the internals of the other.