Building a Robust Text-to-SQL System with Claude: A Practical Guide
Learn how to build a production-ready Text-to-SQL system using Claude. Covers prompting, chain-of-thought, RAG for large schemas, and self-improvement loops.
This guide walks you through building a Text-to-SQL system with Claude, from basic prompting to advanced techniques like chain-of-thought reasoning, RAG for complex schemas, and self-improvement loops that let Claude fix its own SQL errors.
Building a Robust Text-to-SQL System with Claude
Text-to-SQL is one of the most impactful applications of large language models. It bridges the gap between non-technical users and databases, allowing anyone to ask questions in plain English and get structured data back. Claude excels at this task because of its strong reasoning abilities, large context window, and ability to follow complex instructions.
In this guide, you'll learn how to build a production-ready Text-to-SQL system using Claude. We'll start with a basic prompt, then progressively improve it with few-shot examples, chain-of-thought reasoning, Retrieval Augmented Generation (RAG) for large schemas, and a self-improvement loop that lets Claude fix its own mistakes.
Why Text-to-SQL Matters
Before diving into the code, let's understand why this capability is so valuable:
- Democratizes data access: Non-technical team members can query databases without knowing SQL syntax.
- Boosts analyst productivity: Data analysts can prototype complex queries in seconds rather than minutes.
- Enables conversational interfaces: Chatbots and internal tools can answer data questions naturally.
- Handles complexity: LLMs can generate multi-join queries, subqueries, and aggregations that would take humans significant time to write correctly.
Setting Up Your Environment
First, let's create a test SQLite database with two tables: employees and departments. This will be our sandbox throughout the guide.
import sqlite3
Create an in-memory database
conn = sqlite3.connect(":memory:")
cursor = conn.cursor()
Create departments table
cursor.execute('''
CREATE TABLE departments (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
location TEXT
)
''')
Create employees table
cursor.execute('''
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
department_id INTEGER,
salary REAL,
hire_date DATE,
FOREIGN KEY (department_id) REFERENCES departments(id)
)
''')
Insert sample data
cursor.executemany('INSERT INTO departments VALUES (?, ?, ?)', [
(1, 'Engineering', 'New York'),
(2, 'Sales', 'San Francisco'),
(3, 'Marketing', 'Chicago')
])
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?)', [
(1, 'Alice Johnson', 1, 95000, '2020-03-15'),
(2, 'Bob Smith', 1, 85000, '2021-06-01'),
(3, 'Charlie Brown', 2, 72000, '2019-11-20'),
(4, 'Diana Prince', 3, 68000, '2022-01-10'),
(5, 'Eve Davis', 2, 78000, '2020-09-05')
])
conn.commit()
Step 1: The Basic Text-to-SQL Prompt
A good Text-to-SQL prompt needs three things: clear instructions, the user's query, and the database schema. Here's a minimal implementation:
import anthropic
client = anthropic.Anthropic()
def get_schema(conn):
cursor = conn.cursor()
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table'")
return "\n".join(row[0] for row in cursor.fetchall())
def generate_sql(user_query, schema):
prompt = f"""You are a SQL expert. Convert the following natural language query into a SQL query.
Database Schema:
{schema}
User Query: {user_query}
Generate only the SQL query, no explanation."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
Test it
schema = get_schema(conn)
query = "Show me all employees in the Engineering department"
sql = generate_sql(query, schema)
print(sql)
Output: SELECT * FROM employees WHERE department_id = 1;
This works, but it's fragile. For complex queries, you'll get inconsistent results.
Step 2: Improving with Few-Shot Examples
Adding examples (few-shot learning) dramatically improves accuracy. Claude learns the pattern from your examples:
def generate_prompt_with_examples(user_query, schema):
examples = """
Examples:
User: List all departments
SQL: SELECT * FROM departments;
User: Show employees hired after 2021
SQL: SELECT * FROM employees WHERE hire_date > '2021-01-01';
User: What is the average salary by department?
SQL: SELECT d.name, AVG(e.salary) as avg_salary
FROM employees e
JOIN departments d ON e.department_id = d.id
GROUP BY d.name;
"""
prompt = f"""You are a SQL expert. Convert natural language to SQL.
{schema}
{examples}
User: {user_query}
SQL:"""
return prompt
Pro tip: Include a few rows of actual data in your prompt. This gives Claude concrete context about values and formats, reducing hallucination.
Step 3: Chain-of-Thought Prompting
For complex queries, chain-of-thought (CoT) prompting helps Claude reason step-by-step. Use XML tags to structure the reasoning:
def generate_sql_cot(user_query, schema):
prompt = f"""You are a SQL expert. Convert the following query to SQL.
<schema>
{schema}
</schema>
<task>
{user_query}
</task>
First, think through the problem step by step inside <reasoning> tags:
- Identify the tables needed
- Determine the columns to select
- Identify any joins, filters, or aggregations
- Write the SQL
Then output the final SQL inside <sql> tags."""
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1000,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
This approach is especially useful for queries involving multiple joins, subqueries, or complex WHERE clauses.
Step 4: RAG for Large Database Schemas
When your database has hundreds of tables, you can't fit the entire schema in a prompt. Retrieval Augmented Generation (RAG) solves this by dynamically fetching only the relevant schema information.
Here's how to implement it with VoyageAI embeddings:
import voyageai
import numpy as np
class SchemaVectorDB:
def __init__(self, conn):
self.vo = voyageai.Client()
self.tables = self._extract_tables(conn)
self.embeddings = self._create_embeddings()
def _extract_tables(self, conn):
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = []
for (table_name,) in cursor.fetchall():
cursor.execute(f"PRAGMA table_info({table_name})")
columns = [row[1] for row in cursor.fetchall()]
tables.append({"name": table_name, "columns": columns})
return tables
def _create_embeddings(self):
texts = []
for table in self.tables:
text = f"Table {table['name']} has columns: {', '.join(table['columns'])}"
texts.append(text)
result = self.vo.embed(texts, model="voyage-2")
return result.embeddings
def query(self, user_query, top_k=3):
query_embedding = self.vo.embed([user_query], model="voyage-2").embeddings[0]
similarities = [
np.dot(query_embedding, emb)
for emb in self.embeddings
]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.tables[i] for i in top_indices]
Usage
vector_db = SchemaVectorDB(conn)
relevant_tables = vector_db.query("Show me employee salaries")
Returns tables related to employees and salaries
Then inject only the relevant schema into your prompt, keeping it concise and focused.
Step 5: Self-Improvement Loop
One of Claude's most powerful capabilities is self-correction. You can build a loop where Claude generates SQL, executes it, checks for errors, and fixes them:
def text_to_sql_with_self_improvement(user_query, schema, conn, max_attempts=3):
for attempt in range(max_attempts):
# Generate SQL
sql = generate_sql(user_query, schema)
try:
cursor = conn.cursor()
cursor.execute(sql)
results = cursor.fetchall()
# Check if results make sense
validation_prompt = f"""
User asked: {user_query}
Generated SQL: {sql}
Results: {results[:5]}
Are these results correct? If not, explain what's wrong.
"""
validation = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=300,
messages=[{"role": "user", "content": validation_prompt}]
)
if "correct" in validation.content[0].text.lower():
return sql, results
else:
# Feed the error back into the next attempt
user_query += f"\nPrevious attempt had issues: {validation.content[0].text}"
except Exception as e:
# SQL execution error - feed it back
user_query += f"\nPrevious SQL caused error: {str(e)}"
return None, "Failed after max attempts"
This loop handles:
- Syntax errors (Claude fixes malformed SQL)
- Logical errors (wrong joins, missing filters)
- Edge cases (empty results that should have data)
Evaluation: Measuring Success
To build a reliable system, you need to evaluate performance. Create a test set of query-SQL pairs and measure:
- Execution accuracy: Does the SQL run without errors?
- Result accuracy: Do the results match expected output?
- Schema coverage: Does the system handle all table types?
test_cases = [
("List all departments", "SELECT * FROM departments;"),
("Show employees in Engineering", "SELECT * FROM employees WHERE department_id = 1;"),
("Average salary by department",
"SELECT d.name, AVG(e.salary) FROM employees e JOIN departments d ON e.department_id = d.id GROUP BY d.name;")
]
def evaluate(test_cases, schema):
correct = 0
for query, expected_sql in test_cases:
generated_sql = generate_sql(query, schema)
# Normalize and compare
if generated_sql.strip().lower() == expected_sql.strip().lower():
correct += 1
return correct / len(test_cases)
Key Takeaways
- Start simple, iterate fast: Begin with a basic prompt and add complexity (few-shot, CoT, RAG) only as needed. Over-engineering from the start leads to brittle systems.
- Use chain-of-thought for complex queries: Breaking down multi-step SQL generation into reasoning steps dramatically improves accuracy for joins, subqueries, and aggregations.
- RAG is essential for production databases: When your schema exceeds the context window, use embeddings to retrieve only relevant table definitions. This keeps prompts focused and reduces token costs.
- Self-improvement loops catch errors: Claude can analyze its own SQL output, detect issues, and regenerate corrected queries. This is your safety net for production use.
- Always evaluate: Build a test suite of query-SQL pairs. Measure both execution accuracy (no syntax errors) and result accuracy (correct data returned).