SQL Query Generation Environment

Train LLMs to generate correct SQL queries from natural language questions.

Overview

This environment uses the Salesforce/WikiSQL dataset to train language models on text-to-SQL tasks. Queries are verified by executing the generated SQL against in-memory SQLite databases and comparing results to ground truth.

Dataset

Source: Salesforce/WikiSQL
Size: 80,654 examples (train + validation + test)
Format: Natural language questions with table schemas and ground truth SQL

Usage

Training Mode (with API Server)

# Terminal 1: Start the Atropos API
run-api

# Terminal 2: Run the environment
python sql_query_env.py serve --slurm False

Local Testing (without API)

python sql_query_env.py process --env.data_path_to_save_groups sql_output.jsonl

This generates sql_output.jsonl and sql_output.html for inspection.

With Local vLLM Server

python sql_query_env.py process \
    --env.data_path_to_save_groups sql_output.jsonl \
    --openai.base_url http://localhost:9001/v1 \
    --openai.model_name YOUR_MODEL_NAME

Reward Function

Score	Condition
1.0	Generated SQL executes and returns same result as gold SQL
-1.0	SQL fails to execute or returns incorrect result

When all responses in a group are correct, a length penalty is applied to encourage concise solutions.

Prompt Format

The model receives a table schema and question:

Table: data
Columns: col1, col2, col3
Sample data:
  value1 | value2 | value3

Question: What is the value of col1 where col2 equals X?

Output should be in boxed format:

<think>
[Chain of thought reasoning]
</think>

\boxed{SELECT col1 FROM data WHERE col2 = 'X'}

Unit Tests

# Run unit tests
python -m pytest test_sql_executor.py -v

All 19 tests cover:

Table creation with special column names
SQL execution and error handling
\boxed{} extraction patterns
Result comparison and normalization
End-to-end scoring integration

LLM Integration Test

The environment has been verified with Qwen3-8B on an NVIDIA H200:

# Run integration test with a local vLLM server
python test_integration.py --base_url http://localhost:8000/v1 --model Qwen/Qwen3-8B

Test results:

40% accuracy on 10 random WikiSQL examples
SQL extraction from \boxed{} working correctly
Execution-based scoring producing correct reward signals

Files

File	Description
`sql_query_env.py`	Main environment implementation
`sql_executor.py`	SQLite execution and scoring utilities
`wikisql_loader.py`	WikiSQL dataset loader (from GitHub)
`test_sql_executor.py`	Unit tests (19 tests)
`test_integration.py`	LLM integration test

Author

Community contribution to Atropos.