Overview

Recently, Thinking Machines announced the Tinker API, a REST based API for neural network forward/backward passes that unifies inference and training into one common API. It abstracts away the infrastructure challenges of managing GPUs, and allows users to focus purely on the machine learning aspects of the problems to be solved. Through the use of LoRA, it also allows efficient sharing of GPUs between many different users, bringing down the cost of post-training (and online learning!) and making it accessible to everyone.

From a system perspective, it also radically changes how to think about post-training systems and enables viewing them as inference systems that also support backward passes (i.e. computing gradients). This is analogous to how neural network libraries like pytorch or jax allow just writing the forward pass and automatically provide the backward pass through automatic differentiation.

Some advantages of this approach:

Unify inference and training into one common engine. A unified interface for training and inference opens the possibility for building a unified engine for training and inference, which can help remove numerical differences between training and inference (see also Thinking Machine’s defeating nondeterminism in LLM inference) and reduce complexity of maintaining two independent software stacks for training and inference, as well as the need to do expensive transfers of checkpoints.
Allow for seamless online learning and continuous adaptation. With a single API for both forward and backward passes, online learning methods can be easily implemented. A model deployed behind this API and executing an inference workload generates online learning data, which can trivially be used to update model parameters in real-time by calling the API’s training functions.
Enable cost-effective multi-tenancy. By leveraging techniques like LoRA, a single base model can serve thousands of users, each with their own efficiently trained adapter. The simplified API and reduced infrastructure costs lower the barrier to entry, allowing more researchers, developers, and smaller companies to train and adapt powerful models.

We think the Tinker API will have a big impact on how people think about post-training and serving systems, and it will be very useful for machine learning practitioners as well. In order to encourage more people in the open source community to think about and experiment with systems like Tinker, we release SkyRL tx, an open source library that implements a backend for the Tinker API and allows people to set up their own Tinker-like service running on their own hardware.

<aside> ⚠️

This is an early release, and is still experimental. SkyRL tx works end-to-end and can already be used to train models, but the project still has a lot of work left to do, so we want to share it early and invite the community to try it out, give us feedback, and contribute.

</aside>

Architecture

In terms of the architecture, you can think of SkyRL tx as an inference engine that also supports backward passes. The components of the system are:

REST API server: Processes incoming requests from different users
Database: Keeps track of metadata about models, checkpoints, requests and futures, and also serves as a job queue for the work that needs to be done. The database currently uses SQLite but is behind an interface that supports other SQL databases like Postgres.
Engine: Responsible for scheduling and batching requests across users. Each engine supports a single base model, but can support many LoRA adapters.
Worker: Executes the forward and backward passes. It contains the model definitions and optimizer states. Going forward, using multiple workers will enable more sophisticated multi-node model sharding.

SkyRL tx.jpg