Building a Robust Text-to-SQL System with Claude: A Practical Guide
Learn how to build a production-ready Text-to-SQL system using Claude. Covers prompting, few-shot learning, chain-of-thought, RAG for large schemas, and self-improvement loops.
This guide walks you through building a Text-to-SQL system with Claude, from basic prompts to advanced techniques like RAG for large schemas and self-improvement loops that fix errors automatically.
Building a Robust Text-to-SQL System with Claude: A Practical Guide
Imagine giving your non-technical team members the power to query complex databases just by asking questions in plain English. Or picture yourself prototyping complex analytical queries in seconds instead of minutes. That's the promise of Text-to-SQL, and with Claude, it's more accessible than ever.
In this guide, you'll learn how to build a production-ready Text-to-SQL system using Claude. We'll start with the basics and progressively add sophistication: few-shot examples, chain-of-thought reasoning, RAG for large schemas, and a self-improvement loop that lets Claude fix its own mistakes.
Why Text-to-SQL Matters
Text-to-SQL converts natural language questions into structured SQL queries. It's a game-changer for several reasons:
- Accessibility: Non-technical stakeholders can explore data without learning SQL syntax.
- Efficiency: Data analysts can rapidly prototype queries and iterate on ideas.
- Integration: Chatbots and internal tools can offer natural language database interfaces.
- Complexity: LLMs like Claude can generate multi-join queries, subqueries, and aggregations that would take humans significant time to write.
Setting Up Your Environment
First, let's create a test SQLite database with two tables: employees and departments. This will be our sandbox throughout the guide.
import sqlite3
Create an in-memory database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
Create departments table
cursor.execute('''
CREATE TABLE departments (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
location TEXT
)
''')
Create employees table
cursor.execute('''
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
department_id INTEGER,
salary REAL,
hire_date DATE,
FOREIGN KEY (department_id) REFERENCES departments(id)
)
''')
Insert sample data
cursor.executemany('INSERT INTO departments VALUES (?, ?, ?)', [
(1, 'Engineering', 'New York'),
(2, 'Marketing', 'San Francisco'),
(3, 'Sales', 'Chicago')
])
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?)', [
(1, 'Alice Johnson', 1, 95000, '2020-03-15'),
(2, 'Bob Smith', 2, 72000, '2019-07-01'),
(3, 'Charlie Brown', 1, 88000, '2021-01-10'),
(4, 'Diana Prince', 3, 65000, '2022-06-20')
])
conn.commit()
Creating a Basic Text-to-SQL Prompt
The foundation of any Text-to-SQL system is a well-structured prompt. A good prompt includes:
- Clear instructions for the task
- The user's natural language query
- The database schema so Claude knows the available tables and columns
def get_schema_string(conn):
cursor = conn.cursor()
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table'")
schemas = cursor.fetchall()
return '\n'.join([s[0] for s in schemas])
def generate_basic_prompt(user_query, schema):
return f"""You are a SQL expert. Convert the following natural language query into a SQLite SQL query.
Database Schema:
{schema}
User Query: {user_query}
Generate only the SQL query, no explanation."""
Example usage
schema = get_schema_string(conn)
prompt = generate_basic_prompt("Show me all employees in Engineering", schema)
print(prompt)
Now let's call Claude's API to generate the SQL:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=200,
messages=[{"role": "user", "content": prompt}]
)
generated_sql = response.content[0].text
print("Generated SQL:")
print(generated_sql)
Let's test the generated SQL against our database:
cursor.execute(generated_sql)
results = cursor.fetchall()
print("Results:", results)
Improving the Prompt with Examples (Few-Shot Learning)
A basic prompt works, but adding examples—a technique called few-shot learning—dramatically improves accuracy. Examples show Claude the expected format and reasoning pattern.
def generate_few_shot_prompt(user_query, schema):
examples = """
Example 1:
User Query: List all departments
SQL: SELECT * FROM departments;
Example 2:
User Query: Find employees hired after 2020
SQL: SELECT * FROM employees WHERE hire_date > '2020-12-31';
Example 3:
User Query: Show average salary by department
SQL: SELECT d.name, AVG(e.salary) FROM employees e JOIN departments d ON e.department_id = d.id GROUP BY d.name;
"""
return f"""You are a SQL expert. Convert natural language queries into SQLite SQL.
Database Schema:
{schema}
Here are some examples:
{examples}
Now convert this query:
User Query: {user_query}
SQL:"""
Pro tip: Consider including a few rows of sample data in your prompt. This gives Claude additional context about the actual values and data patterns, which can improve accuracy for queries involving filters or aggregations.
Using Chain-of-Thought Prompting
For complex queries involving multiple joins, subqueries, or conditional logic, chain-of-thought (CoT) prompting helps Claude reason step-by-step before generating the final SQL.
def generate_cot_prompt(user_query, schema):
return f"""You are a SQL expert. Convert the following query into SQLite SQL.
Database Schema:
{schema}
User Query: {user_query}
Let's approach this step-by-step:
<reasoning>
- Identify the tables needed
- Determine the columns to select
- Identify join conditions
- Apply filters and aggregations
- Consider ordering and limits
</reasoning>
Now, based on your reasoning, generate the SQL query:
<sql>
"""
The XML tags (<reasoning> and <sql>) help structure Claude's output, making it easier to parse the final SQL programmatically.
Implementing RAG for Complex Database Schemas
Real-world databases can have hundreds of tables. Including the entire schema in every prompt is wasteful and can confuse the model. Retrieval Augmented Generation (RAG) solves this by dynamically fetching only the relevant schema information.
First, let's build a simple vector search system using VoyageAI embeddings:
import voyageai
import numpy as np
from typing import List, Dict
class SchemaVectorDB:
def __init__(self, api_key: str):
self.client = voyageai.Client(api_key=api_key)
self.embeddings = []
self.metadata = []
def add_table(self, table_name: str, columns: List[Dict], description: str = ""):
text = f"Table: {table_name}. Columns: {', '.join([c['name'] + ' (' + c['type'] + ')' for c in columns])}. Description: {description}"
embedding = self.client.embed([text], model="voyage-2").embeddings[0]
self.embeddings.append(embedding)
self.metadata.append({"table": table_name, "columns": columns, "text": text})
def query(self, user_query: str, top_k: int = 3) -> List[str]:
query_embedding = self.client.embed([user_query], model="voyage-2").embeddings[0]
similarities = [np.dot(query_embedding, emb) for emb in self.embeddings]
top_indices = np.argsort(similarities)[-top_k:][::-1]
return [self.metadata[i]["text"] for i in top_indices]
Initialize and populate the vector DB
schema_db = SchemaVectorDB(api_key="your-voyage-api-key")
schema_db.add_table("employees", [
{"name": "id", "type": "INTEGER"},
{"name": "name", "type": "TEXT"},
{"name": "department_id", "type": "INTEGER"},
{"name": "salary", "type": "REAL"},
{"name": "hire_date", "type": "DATE"}
], description="Employee records")
schema_db.add_table("departments", [
{"name": "id", "type": "INTEGER"},
{"name": "name", "type": "TEXT"},
{"name": "location", "type": "TEXT"}
], description="Department information")
Now update your prompt function to use RAG:
def generate_rag_prompt(user_query: str, schema_db: SchemaVectorDB):
relevant_schemas = schema_db.query(user_query, top_k=2)
schema_context = "\n".join(relevant_schemas)
return f"""You are a SQL expert. Use the following relevant schema information to answer the query.
Relevant Schema:
{schema_context}
User Query: {user_query}
Generate the SQL query:"""
Implementing Query Self-Improvement
Even with great prompts, Claude might generate SQL that has syntax errors or returns unexpected results. A self-improvement loop lets Claude execute its generated SQL, analyze the output, and fix issues automatically.
def self_improving_query(user_query: str, schema: str, conn, max_iterations: int = 3):
prompt = generate_few_shot_prompt(user_query, schema)
for i in range(max_iterations):
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
sql = response.content[0].text.strip()
try:
cursor = conn.cursor()
cursor.execute(sql)
results = cursor.fetchall()
# If we get here, SQL executed successfully
return {"sql": sql, "results": results, "iterations": i + 1}
except Exception as e:
error_msg = str(e)
print(f"Attempt {i+1} failed: {error_msg}")
# Feed the error back to Claude for improvement
prompt = f"""The previous SQL query failed with this error: {error_msg}
Original query: {user_query}
Database Schema:
{schema}
Please generate a corrected SQL query:"""
return {"sql": None, "results": None, "error": "Max iterations reached"}
This self-improvement loop is especially valuable in production systems where you can't manually review every generated query. It handles:
- Syntax errors: Misspelled keywords, missing commas, etc.
- Schema mismatches: Referencing non-existent columns or tables.
- Logic errors: Queries that run but return empty or incorrect results.
Evaluating Your Text-to-SQL System
To ensure your system is reliable, you need a robust evaluation framework. Here's a simple approach:
- Create a test dataset: Prepare a set of natural language queries paired with the expected SQL and expected results.
- Run automated tests: For each test case, generate SQL, execute it, and compare results.
- Track metrics: Measure accuracy (exact SQL match), execution success rate, and result correctness.
def evaluate_system(test_cases, conn, schema):
results = []
for case in test_cases:
output = self_improving_query(case["query"], schema, conn)
sql_match = output["sql"] == case["expected_sql"]
result_match = output["results"] == case["expected_results"]
results.append({
"query": case["query"],
"sql_match": sql_match,
"result_match": result_match,
"generated_sql": output["sql"]
})
accuracy = sum(r["result_match"] for r in results) / len(results)
print(f"Overall accuracy: {accuracy:.2%}")
return results
Key Takeaways
- Start simple, then layer complexity: Begin with a basic prompt, then add few-shot examples, chain-of-thought reasoning, and RAG as your schema grows.
- RAG is essential for production systems: When your database has dozens or hundreds of tables, dynamically retrieving relevant schema information keeps prompts focused and accurate.
- Self-improvement loops reduce manual oversight: Let Claude execute, analyze, and fix its own SQL queries. This dramatically improves reliability in production.
- Include sample data in prompts: A few rows of real data help Claude understand data patterns and write more accurate queries.
- Evaluate systematically: Build a test suite with expected SQL and results to track improvements and catch regressions as you iterate on your system.