Philipp Moritz, Tyler Griggs, and the SkyRL Team

๐Ÿ—“๏ธ Posted: November 3, 2025

<aside>

We are happy to announce SkyRL tx v0.1.0!

SkyRL tx is a unified training and inference engine that implements the Tinker API and allows people to set up their own Tinker-like service running on their own hardware.

This is the first version that supports RL end-to-end, and also sampling is now significantly faster.

</aside>

<aside> ๐Ÿ‘‹

We are giving a talk on SkyRL tx: A unified training and inference engine on November 4 at 4pm at this yearโ€™s Ray Summit. If you are around, come say hi!

</aside>

Updates

There are a number of PRs that are currently in-flight and will be part of the next release, including supporting an external inference engine (#568), LoRA support for the embeddings (#511), support for prompt logprobs (#577), enabling full parameter optimization (#611), covering more scenarios for the sampling (#613). Thanks Guido, Lucas and Atem for the contributions!

As always, we welcome more contributions! We are currently falling a little behind in extending the documentation, if you are excited about contributing to it, that would be particularly welcome. Similarly welcome is implementing more functionality of the Tinker API, performance optimizations, and any of the currently open tasks here or really anything you would like to see implemented.

Running RL end-to-end

Since this is the first release that supports RL end-to-end, we conclude the blog post with a instructions on how to run it on 8xH100. First clone https://github.com/NovaSky-AI/SkyRL and in the skyrl-tx folder, start the engine with

uv run --extra gpu --extra tinker -m tx.tinker.api --base-model Qwen/Qwen3-4B --max-lora-adapters 3 --max-lora-rank 1 --tensor-parallel-size 8 --train-micro-batch-size 8 > out.log

Then, clone https://github.com/thinking-machines-lab/tinker-cookbook and in the tinker_cookbook/recipes folder, run

export TINKER_API_KEY=dummy
export WANDB_API_KEY=<your key>
uv run --with wandb --with tinker rl_loop.py base_url=http://localhost:8000 model_name="Qwen/Qwen3-4B" lora_rank=1 max_length=1024 save_every=100

You should get a reward curve like the following: