Guide2026-05-03

Building a Robust Text-to-SQL System with Claude: A Practical Guide

Learn how to build a production-ready Text-to-SQL system using Claude. Covers prompting, chain-of-thought, RAG for large schemas, and self-improvement loops.

Quick Answer

This guide walks you through building a Text-to-SQL system with Claude, from basic prompting to advanced techniques like chain-of-thought reasoning, RAG for large schemas, and self-improvement loops that let Claude fix its own SQL errors.

Text-to-SQLClaude APIPrompt EngineeringRAGSQL Generation

Building a Robust Text-to-SQL System with Claude

Text-to-SQL is one of the most practical applications of large language models in the enterprise. It allows non-technical users to query databases using natural language, and helps data analysts prototype queries faster. Claude excels at this task because of its strong reasoning capabilities, large context window, and ability to follow structured instructions.

In this guide, you'll learn how to build a production-ready Text-to-SQL system using Claude. We'll start with a basic prompt, then progressively improve it with few-shot examples, chain-of-thought reasoning, Retrieval Augmented Generation (RAG) for large schemas, and a self-improvement loop that lets Claude fix its own mistakes.

Why Text-to-SQL Matters

Before diving into the code, let's understand why this capability is so valuable:

Accessibility: Business analysts, product managers, and executives can query databases without knowing SQL syntax.
Efficiency: Data professionals can quickly prototype complex queries involving joins, subqueries, and aggregations.
Integration: Text-to-SQL enables natural language interfaces for chatbots, dashboards, and internal tools.
Accuracy: With proper prompting and validation, Claude can generate syntactically correct SQL that handles edge cases.

Setting Up Your Environment

First, let's set up a test SQLite database with two tables: employees and departments. This will be our sandbox for the entire guide.

import sqlite3
Create an in-memory SQLite database
conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
Create departments table
cursor.execute('''
CREATE TABLE departments (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    location TEXT
)
''')
Create employees table
cursor.execute('''
CREATE TABLE employees (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    department_id INTEGER,
    salary REAL,
    hire_date DATE,
    FOREIGN KEY (department_id) REFERENCES departments(id)
)
''')
Insert sample data
cursor.executemany('INSERT INTO departments VALUES (?, ?, ?)', [
    (1, 'Engineering', 'New York'),
    (2, 'Marketing', 'San Francisco'),
    (3, 'Sales', 'Chicago')
])
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?)', [
    (1, 'Alice', 1, 95000, '2020-01-15'),
    (2, 'Bob', 1, 85000, '2021-03-20'),
    (3, 'Charlie', 2, 75000, '2019-11-01'),
    (4, 'Diana', 3, 80000, '2022-06-10'),
    (5, 'Eve', 2, 72000, '2023-02-28')
])
conn.commit()

Step 1: The Basic Text-to-SQL Prompt

A good Text-to-SQL prompt needs three things: clear instructions, the database schema, and the user's natural language query. Here's a minimal implementation:

import anthropic
client = anthropic.Anthropic()
def get_schema():
    """Extract schema from SQLite database"""
    cursor.execute("SELECT sql FROM sqlite_master WHERE type='table'")
    schemas = cursor.fetchall()
    return '\n'.join([s[0] for s in schemas if s[0]])
def basic_text_to_sql(user_query):
    schema = get_schema()
    prompt = f"""You are a SQL expert. Given the following database schema and a user query, generate a valid SQL query.
Database Schema:
{schema}
User Query: {user_query}
Generate only the SQL query, no explanation."""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text
Test it
query = "Show me all employees in the Engineering department"
sql = basic_text_to_sql(query)
print(f"Generated SQL: {sql}")
Execute and verify
cursor.execute(sql)
results = cursor.fetchall()
print(f"Results: {results}")

This works, but it's fragile. For complex queries, the model might misinterpret column names or generate incorrect joins.

Step 2: Improving with Few-Shot Examples

Including examples in your prompt (few-shot learning) dramatically improves accuracy. Let's modify our function:

def few_shot_text_to_sql(user_query):
    schema = get_schema()
    examples = """
Examples:
User: List all departments
SQL: SELECT * FROM departments;
User: Show employee names and their department names
SQL: SELECT e.name, d.name FROM employees e JOIN departments d ON e.department_id = d.id;
User: What is the average salary by department?
SQL: SELECT d.name, AVG(e.salary) FROM employees e JOIN departments d ON e.department_id = d.id GROUP BY d.name;
"""
    
    prompt = f"""You are a SQL expert. Given the database schema and examples, generate a SQL query.
Database Schema:
{schema}
{examples}
User Query: {user_query}
Generate only the SQL query:"""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Pro tip: Include a few rows of actual data in your prompt. This gives Claude concrete context about values and formats, which helps with queries involving dates, strings, or numeric comparisons.

Step 3: Chain-of-Thought Prompting for Complex Queries

For multi-step queries, chain-of-thought (CoT) prompting helps Claude reason through the problem. Use XML tags to structure the reasoning:

def cot_text_to_sql(user_query):
    schema = get_schema()
    prompt = f"""You are a SQL expert. For complex queries, break down your reasoning step by step.
Database Schema:
{schema}
User Query: {user_query}
<reasoning>
Identify the tables needed
Determine the columns to select
Identify join conditions
Apply filters and aggregations
Consider ordering and limits
</reasoning>
<sql>
-- Your SQL query here
</sql>"""
    
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=800,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Extract SQL from XML tags
    content = response.content[0].text
    sql_start = content.find('<sql>')
    sql_end = content.find('</sql>')
    if sql_start != -1 and sql_end != -1:
        return content[sql_start+5:sql_end].strip()
    return content

This approach is especially useful for queries involving multiple joins, subqueries, or window functions.

Step 4: RAG for Large Database Schemas

When your database has dozens or hundreds of tables, you can't fit the entire schema in a prompt. Retrieval Augmented Generation (RAG) solves this by dynamically fetching only the relevant schema information.

Here's how to implement it using VoyageAI embeddings:

import voyageai
from typing import List, Dict
class SchemaRetriever:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.table_descriptions = []
        self.embeddings = []
    
    def index_schema(self, tables: List[Dict[str, str]]):
        """Index table schemas with descriptions"""
        self.table_descriptions = tables
        texts = [t['description'] for t in tables]
        self.embeddings = self.client.embed(texts, model="voyage-2").embeddings
    
    def retrieve(self, query: str, top_k: int = 3) -> List[Dict]:
        """Retrieve most relevant tables for a query"""
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        
        # Compute cosine similarity (simplified)
        import numpy as np
        similarities = [
            np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
            for emb in self.embeddings
        ]
        
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [self.table_descriptions[i] for i in top_indices]
Usage
retriever = SchemaRetriever(api_key="your-voyage-api-key")
retriever.index_schema([
    {"name": "employees", "description": "Employee records with id, name, department_id, salary, hire_date"},
    {"name": "departments", "description": "Department info with id, name, location"},
    {"name": "projects", "description": "Project assignments linking employees to projects"}
])
For a query about salaries, only employees and departments are retrieved
relevant_tables = retriever.retrieve("average salary by department")

Then, instead of passing the entire schema, you pass only the retrieved tables in your prompt.

Step 5: Self-Improvement Loop

The most powerful technique is letting Claude execute its own SQL, analyze the results, and fix errors. This creates a self-correcting loop:

def self_improving_text_to_sql(user_query, max_attempts=3):
    schema = get_schema()
    
    for attempt in range(max_attempts):
        # Generate SQL
        prompt = f"""Generate a SQL query for: {user_query}
Schema: {schema}
SQL:"""
        
        response = client.messages.create(
            model="claude-3-5-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )
        sql = response.content[0].text.strip()
        
        # Try to execute
        try:
            cursor.execute(sql)
            results = cursor.fetchall()
            
            # Ask Claude to verify the results
            verification_prompt = f"""The query '{sql}' returned: {results}
Does this correctly answer: {user_query}?
If yes, say 'CORRECT'. If no, explain what's wrong."""
            
            verify_response = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                max_tokens=300,
                messages=[{"role": "user", "content": verification_prompt}]
            )
            
            if "CORRECT" in verify_response.content[0].text:
                return sql, results
            else:
                # Feed the error back to Claude
                error_msg = verify_response.content[0].text
                prompt = f"""Previous attempt failed. Error: {error_msg}
Fix the SQL query for: {user_query}
Schema: {schema}
SQL:"""
                
        except Exception as e:
            # SQL execution error - feed it back
            prompt = f"""Previous SQL '{sql}' caused error: {str(e)}
Fix the SQL query for: {user_query}
Schema: {schema}
SQL:"""
    
    return None, "Failed after max attempts"

This loop handles:

Syntax errors: Claude can fix typos, missing commas, or incorrect keywords.
Logic errors: If results don't match the query intent, Claude can adjust joins, filters, or aggregations.
Schema misunderstandings: If Claude joins the wrong tables, the error feedback helps it correct course.

Evaluation: Measuring Your System's Performance

To build confidence in your Text-to-SQL system, you need a robust evaluation framework. Here's a simple approach:

def evaluate_text_to_sql(test_cases):
    """
    test_cases: list of dicts with 'query', 'expected_sql', and 'expected_results'
    """
    results = []
    for case in test_cases:
        generated_sql, _ = self_improving_text_to_sql(case['query'])
        
        # Execute generated SQL
        try:
            cursor.execute(generated_sql)
            actual_results = cursor.fetchall()
        except:
            actual_results = None
        
        # Execute expected SQL
        try:
            cursor.execute(case['expected_sql'])
            expected_results = cursor.fetchall()
        except:
            expected_results = None
        
        # Compare
        is_correct = actual_results == expected_results
        results.append({
            'query': case['query'],
            'generated_sql': generated_sql,
            'is_correct': is_correct
        })
    
    accuracy = sum(r['is_correct'] for r in results) / len(results)
    return accuracy, results

For production systems, consider:

Execution accuracy: Does the SQL run without errors?
Result accuracy: Do the results match expected output?
Semantic accuracy: Does the SQL capture the user's intent, even if the exact output differs?

Key Takeaways

Start simple, iterate fast: Begin with a basic prompt, then add few-shot examples, chain-of-thought reasoning, and finally RAG and self-improvement loops as complexity grows.
Include schema and data context: Always provide the database schema in your prompt. Including sample rows of data significantly improves accuracy for queries involving specific values.
Use chain-of-thought for complex queries: Breaking down multi-step SQL generation into reasoning steps helps Claude handle joins, subqueries, and aggregations correctly.
Implement RAG for large schemas: When your database has many tables, use embeddings to retrieve only the relevant schema information for each query.
Build a self-improvement loop: Let Claude execute its SQL, analyze errors, and fix them. This dramatically improves reliability in production.
Evaluate rigorously: Measure execution accuracy, result accuracy, and semantic correctness to track improvements and catch regressions.