Shu Liu$^{1}$$^{\dagger}$, Sumanth Hegde$^{2}$, Shiyi Cao$^{1}$, Alan Zhu$^{1}$, Dacheng Li$^{1}$, Tyler Griggs$^{1}$, Eric Tang$^{2}$, Akshay Malik$^{2}$, ****Kourosh Hakhamaneshi$^{2}$, Richard Liaw$^{2}$, Philipp Moritz$^{2}$, Matei Zaharia$^{1}$, Joseph E. Gonzalez$^{1}$, Ion Stoica$^{1}$

$^{1}$University of California, Berkeley

$^{2}$Anyscale

$^{\dagger}$Project Lead

*Core Contributor

🗓️ Posted: May 20, 2025

<aside> 💡

SkyRL-SQL: Simple, Efficient Multi-Turn RL for Text2SQL

In this post, we share early results on enabling multi-turn reinforcement learning (RL) for Text-to-SQL, where we teach an LLM to iteratively build and refine SQL queries using feedback from databases. We introduce a simple, data-efficient, and scalable multi-turn RL training pipeline for Text-to-SQL task, built on top of VeRL and SearchR1.

🚀 Using just ~600 training datapoints, our model SkyRL-SQL-7B improves base model execution accuracy by up to 9.2% on 5 different Spider benchmarks, outperforming GPT-4o, o4-mini, and OmniSQL-7B (i.e. an open-source SFT model trained on 2.5 million samples).

Check out our data, model, and code ****for more details!

📊 Data: SkyRL-SQL-653-data

🐼 Model: SkyRL-SQL-7B

🐉 Training Code: https://github.com/NovaSky-AI/SkyRL

</aside>

🔍 Overview

Recent advances in RL have turned LLMs into interactive agents capable of reasoning, exploring, and executing actions. While tasks like math, web search, kernel code generation have seen impressive progress, database interaction remains relatively under-explored.

In this post, we focus on Text-to-SQL — ****translating natural language into SQL queries. In real-world applications, natural language questions are often vague or incomplete. For example, users might say “latest review” without referencing an exact column or expect joins across tables without stating them explicitly. Human analysts typically resolve such ambiguity by exploring the database step by step — a practice known as exploratory data analysis (EDA). In contrast, the classical one-shot SQL generation by LLMs is often error-prone.

Inspired by the exploratory data analysis, we train LLM agents to do the same: probe the database, refine, and verify the SQL query through database interactions. This converts query generation into a multi-turn decision process, where models incrementally explore and self-correct through trial and error.

In this post, we share early results from training a Text-to-SQL agent using RL in a database environment, with a focus on reducing execution errors—ensuring that generated SQL queries can run without failure.

Figure 1. Workflow on Text2SQL tasks with database interactions.

🔑 Key Takeaways

<aside> 💡

Multi-turn RL learns faster in training steps and generalizes better than single-turn training, even without environment feedback at test time.
Small data and simple rewards suffice — using just 0.03% of the data, our model beats SFT trained on 2.5M data samples and RL with complex partial rewards.
The model learns to reflect, verify, and debug SQL through database feedback — mimicking behavior of human analysts.
Failures patterns reveal insights — overconfidence and looping behaviors highlight areas for improving RL in database settings. </aside>

🧠 Multi-Turn SQL Interactions

Just like how human analysts explore unfamiliar databases, in our multi-turn SQL setup, the model is allowed to

Issue free-form exploratory SQL (<sql>…</sql>) ****
Observe partial result (<observation>…</observation>)