Philipp Moritz, Hao Chen, Tyler Griggs, and the SkyRL Team

🗓️ Posted: February 8, 2026

<aside>

We are happy to announce SkyRL tx v0.3.0!

SkyRL tx is a LoRA native training and inference engine that implements the Tinker API and allows people to set up a Tinker-like service running on their own hardware.

In this release, we add expert parallel support, DeepSeekV3 model support (e.g. the GLM 4.7 Flash model), a number of optimizations for long sequence lengths, and some smaller features and performance optimizations as well as a few bug fixes.

</aside>

<aside> 📢

We gave a talk on SkyRL tx: A unified training and inference engine at this year’s Ray Summit, check out the recording and slides.

</aside>

Updates

We now support expert-parallel sharding, which leads to large performance improvements for MoE models: Weights are sharded by experts, so each shard only needs to process tokens routed to experts on that shard, which leads to a reduction in communication overhead. Also each shard has larger individual matrices since they don’t need to be split along the tensor-parallel dimension (though TP and EP sharding can be combined if desired). This makes the matrix multiplications more efficient. EP sharding was implemented in a series of PRs:
- In #860, we implemented a somewhat naive version of Jax ragged_dot that supports group_offset (see the upstream jax.lax.ragged_dot documentation). This was necessary, since the parameter is not implemented upstream, see this issue. The group_offset parameter is used to only evaluate the subset of tokens assigned to experts on each shard. While this naive version does extra work, integrating it already gives a good speedup.
- In #842, we used jax.shard_map to implement the expert sharding. Unlinke for TP and FSDP sharding, EP sharding is not automatically supported by the Jax sharding annotations, since the compiler cannot automatically split up the computation (it cannot know certain tokens only need to be processed by certain experts). Naively trying to use sharding annotations will instruct XLA to introduce all-gather operations to get all experts on all shards which is not efficient. Shard map allows us to elegantly express the expert parallel computation, and use sharding annotations for the remaining sharding dimensions like FSDP and TP (check out the PR!). In terms of performance, using expert parallelism for the Qwen3-30B-A3B model reduced the step time from 110s for TP=8 to 40s for EP=8. This can further be reduced to 20s by using optimized kernels:
- We have also been working on kernels to optimize the naive version of ragged_dot, see this issue. The ragged_dot kernel is central for both LoRA as well as MoE performance, so having a great one that also supports commonly used quantization schemes like FP8 and FP4 going forward will be very important. We have a CUTLASS implementation in #896 that works end-to-end and gives a 2x speedup on the Qwen3 30B A3B model. It has downsides like needing binaries to be shipped and needing to be adapted and tuned for different architectures like Ampere, Hopper and Blackwell. We also have a draft of a CUDA Tile PR in #880, which will be very compelling once CUDA Tile supports older GPU generations than Blackwell (see the this issue). There is also a pallas implementation in #867. We are planning to merge one of these relatively soon and will pick based on performance and ease of integration. Thanks Tanmay and Ago for working on these kernels!
We added support for DeepSeekV3 models (#889). Thanks Tanmay for the contribution! This is exciting because many modern models like GLM 4.7 Flash are based on the DeepSeekV3 architecture and can therefore be used now (#1023). Further down in the blog post you can find instructions of how you can train the GLM 4.7 Flash model.
We added a number of significant optimizations for long sequence support:
- Use cuDNN flash attention for the attention computation (#879). This saves memory for large sequence lengths, and also comes with a small optimization of not needing an explicit attention mask.
- Skip full logit computation during prefill in the common case where no prompt logprobs are returned to the user (#878).
- Chunked logprob computation for memory efficiency (#902). This will apply the lm_head over tokens in chunks of loss_chunk_size , which is a configurable engine parameter.
- We have also been working on support stacked weights for models (#1018) and per-layer checkpointing for reduced memory usage. This not only significantly reduces JIT compilation time for prefill and training (e.g. for the Qwen3-30B-A3B SFT example with EP=8, the extra overhead from compilation goes from 240s to 30s) but also allows larger batch sizes due to the memory savings in the backward pass (bs=4 instead of bs=1). This is not part of the current release yet but will be merged soon.
Going forward, we are also planning to implement sequence / context parallelism to support longer context even better (#1056).
We implemented model unloading support #844 which makes sure models are properly cleaned up if there are no heartbeats from the clients that instantiated these models any more.
Improved Tinker API compatibility: Implement top_p sampling #830, thanks John for the contribution! Also we updated the API to support the latest SDK (#837).
There is a number of PRs to make it possible to use the SkyRL Train backend with the Tinker API server (#871, #978, #1010, #999, #1046, #1047). This is part of the ongoing SkyRL Tinkerification effort and we will soon release a fully functional version of this backend, supporting both megatron and FSDP2. Going forward, we are planning to integrate the API server more natively with SkyRL, possibly with a common PyPI package for all SkyRL Tinker backends (SkyRL train for pytorch and tx for jax). The goal with this restructure will be to expose one common UX and documentation for users (e.g. skyrl tinker --backend="jax" --backend-config="..."), but at the same time ensure the code can be as compact as possible and developers only need to be aware of the code for the subpart of the project they are interested in. As part of this effort, we will also work a lot more on the documentation going forward!

Bugfixes

In the multi-node setting, sampling checkpoints were not working with local file systems, this is now fixed in #1031. Thanks Sutirtha for reporting this bug!
In the case of working with an external inference engine, num_samples for the sampling endpoint was not passed through correctly. This is now fixed in #1015, thanks Chirag for reporting and fixing this! We also fixed sampling from the base model using the external inference engine #1039.
We now use different seeds for the different examples in the case num_samples > 1 (#1042)
We activated WAL model and also handle write conflicts better now when sqlite is used as the database (#1054). This prevents 'database is locked' errors.
In #1041 a small bug was fixed when using the sample endpoint directly though curl and not the Tinker SDK. Thanks Jared for fixing this!

There are a number of exciting in-flight PRs, like support for the full GLM 4.7 model #989, support for mHC #1008, support for running on a Ray cluster #955 and support for the Olmo 3 model #1043. Thanks to Tanmay, Han-Ju Chen and Jiang for the contributions!

As always, we welcome more contributions!