GuideBeginnerBest Practices2026-05-15

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide teaches you to build a production-ready RAG system with Claude, covering basic setup with Voyage AI embeddings, creating an evaluation suite with precision/recall/F1 metrics, and advanced optimization techniques like summary indexing and re-ranking that improved end-to-end accuracy from 71% to 81%.

RAGClaude APIVoyage AIVector SearchEvaluation

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities into your specific business context. While Claude excels at general knowledge tasks, it needs RAG to answer questions about your internal documents, customer support history, or proprietary research.

In this guide, we'll walk through building a complete RAG system using Claude and Voyage AI embeddings, then systematically improve it using evaluation-driven development. We'll cover three levels of sophistication:

Basic RAG - Simple chunking, embedding, and retrieval
Summary Indexing - Adding document summaries for better context
Re-ranking - Using Claude to improve result ordering

By the end, you'll have a production-ready approach that improved our metrics from 71% to 81% end-to-end accuracy.

Prerequisites and Setup

Before diving in, you'll need:

An Anthropic API key for Claude
A Voyage AI API key for embeddings
Python 3.8+ with basic data science libraries

Installing Dependencies

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Initializing Your Vector Database

For this guide, we'll use an in-memory vector store. In production, consider managed solutions like Pinecone, Weaviate, or pgvector.

import voyageai
from anthropic import Anthropic
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self, voyage_client):
        self.documents = []
        self.embeddings = []
        self.voyage = voyage_client
    
    def add_documents(self, texts: List[str]):
        """Add documents and their embeddings to the store."""
        response = self.voyage.embed(texts, model="voyage-2")
        self.embeddings.extend(response.embeddings)
        self.documents.extend(texts)
    
    def search(self, query: str, k: int = 3) -> List[Dict[str, Any]]:
        """Retrieve top-k documents by cosine similarity."""
        query_embedding = self.voyage.embed([query], model="voyage-2").embeddings[0]
        
        # Compute cosine similarities
        similarities = [
            np.dot(query_embedding, doc_emb) / 
            (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
            for doc_emb in self.embeddings
        ]
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[-k:][::-1]
        
        return [
            {"text": self.documents[i], "score": similarities[i]}
            for i in top_indices
        ]
Initialize clients
voyage_client = voyageai.Client(api_key="YOUR_VOYAGE_API_KEY")
anthropic_client = Anthropic(api_key="YOUR_ANTHROPIC_API_KEY")
db = InMemoryVectorDB(voyage_client)

Level 1: Basic RAG Pipeline

Let's start with what's often called "Naive RAG" - a straightforward three-step process:

Chunk documents by heading or section
Embed each chunk using Voyage AI
Retrieve relevant chunks via cosine similarity

Implementing the Basic Pipeline

def chunk_document(text: str, heading_pattern: str = "## ") -> List[str]:
    """Split document by headings for semantic chunks."""
    chunks = []
    current_chunk = []
    
    for line in text.split("\n"):
        if line.startswith(heading_pattern) and current_chunk:
            chunks.append("\n".join(current_chunk))
            current_chunk = [line]
        else:
            current_chunk.append(line)
    
    if current_chunk:
        chunks.append("\n".join(current_chunk))
    
    return chunks
def basic_rag(query: str, db: InMemoryVectorDB, k: int = 3) -> str:
    """Basic RAG: retrieve chunks and generate answer."""
    # Step 1: Retrieve relevant chunks
    results = db.search(query, k=k)
    context = "\n\n".join([r["text"] for r in results])
    
    # Step 2: Generate answer with Claude
    prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
    
    response = anthropic_client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need quantitative metrics to measure and improve your RAG system. Let's build a proper evaluation suite.

Creating a Test Dataset

Generate a synthetic evaluation dataset with 100+ samples. Each sample should include:

A question
The correct answer
The relevant document chunks

import json
def create_evaluation_sample(question: str, answer: str, relevant_chunks: List[str]) -> Dict:
    return {
        "question": question,
        "expected_answer": answer,
        "relevant_chunks": relevant_chunks
    }
Load or generate your dataset
evaluation_data = json.load(open("evaluation_dataset.json"))

Key Metrics Explained

#### Retrieval Metrics

Precision measures how many retrieved chunks are actually relevant:

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Retrieved}|}$$

High precision means fewer false positives - you're not wasting Claude's context window on irrelevant information.

Recall measures how many relevant chunks you successfully retrieved:

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Relevant}|}{|\text{Relevant}|}$$

High recall ensures Claude has all the information it needs.

F1 Score is the harmonic mean of precision and recall:

$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Mean Reciprocal Rank (MRR) measures how early the first relevant result appears:

$$\text{MRR} = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{\text{rank}_i}$$

#### End-to-End Metric

Accuracy measures whether Claude's final answer is correct given the retrieved context.

Implementing the Evaluation

def evaluate_retrieval(db: InMemoryVectorDB, eval_data: List[Dict], k: int = 3):
    """Evaluate retrieval performance."""
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for sample in eval_data:
        query = sample["question"]
        relevant = set(sample["relevant_chunks"])
        
        # Retrieve chunks
        results = db.search(query, k=k)
        retrieved = set([r["text"] for r in results])
        
        # Calculate metrics
        true_positives = len(retrieved & relevant)
        
        precision = true_positives / len(retrieved) if retrieved else 0
        recall = true_positives / len(relevant) if relevant else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: find first relevant result
        mrr = 0
        for i, r in enumerate(results):
            if r["text"] in relevant:
                mrr = 1 / (i + 1)
                break
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic chunking loses the forest for the trees. Summary indexing adds a high-level overview of each document section, improving retrieval for questions that require synthesis.

How Summary Indexing Works

For each document chunk, generate a summary using Claude
Store both the original chunk and its summary
When searching, match against summaries first, then retrieve full chunks

def generate_summary(chunk: str, anthropic_client: Anthropic) -> str:
    """Generate a concise summary of a document chunk."""
    response = anthropic_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{
            "role": "user", 
            "content": f"Summarize this in 1-2 sentences:\n\n{chunk}"
        }]
    )
    return response.content[0].text
def build_summary_index(chunks: List[str], anthropic_client: Anthropic, voyage_client) -> InMemoryVectorDB:
    """Build a vector index with summaries."""
    summary_db = InMemoryVectorDB(voyage_client)
    
    for chunk in chunks:
        summary = generate_summary(chunk, anthropic_client)
        # Store summary + chunk for retrieval
        summary_db.add_documents([f"Summary: {summary}\n\nFull: {chunk}"])
    
    return summary_db

Level 3: Re-ranking with Claude

Even with good embeddings, the top-k results aren't always optimally ordered. Re-ranking uses Claude to evaluate and reorder retrieved chunks based on relevance to the specific question.

Implementing Re-ranking

def rerank_chunks(query: str, chunks: List[str], anthropic_client: Anthropic, top_k: int = 3) -> List[str]:
    """Use Claude to re-rank retrieved chunks by relevance."""
    # Prepare chunks for evaluation
    chunk_text = "\n\n---\n\n".join([
        f"Chunk {i+1}: {chunk}" for i, chunk in enumerate(chunks)
    ])
    
    prompt = f"""Given the question below, rank these chunks by relevance (most relevant first).
Question: {query}
{chunk_text}
Return the chunk numbers in order of relevance, like: 3, 1, 2"""
    
    response = anthropic_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=50,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the ranking
    ranking = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
    ranked_chunks = [chunks[i] for i in ranking[:top_k]]
    
    return ranked_chunks
def advanced_rag(query: str, db: InMemoryVectorDB, anthropic_client: Anthropic, k: int = 5) -> str:
    """Advanced RAG with re-ranking."""
    # Retrieve more chunks than needed
    results = db.search(query, k=k)
    chunks = [r["text"] for r in results]
    
    # Re-rank with Claude
    top_chunks = rerank_chunks(query, chunks, anthropic_client, top_k=3)
    context = "\n\n".join(top_chunks)
    
    # Generate final answer
    prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
    
    response = anthropic_client.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.content[0].text

Results and Performance Gains

After implementing these optimizations, we achieved significant improvements:

Metric	Basic RAG	Advanced RAG	Improvement
Avg Precision	0.43	0.44	+2%
Avg Recall	0.66	0.69	+5%
Avg F1 Score	0.52	0.54	+4%
Avg MRR	0.74	0.87	+18%
End-to-End Accuracy	71%	81%	+14%

The most dramatic improvement came in MRR (Mean Reciprocal Rank), showing that re-ranking effectively pushes the most relevant chunks to the top of Claude's context window.

Production Considerations

Rate Limits: Full evaluations can hit API rate limits. Consider using Tier 2+ accounts or running evaluations incrementally.
Cost Management: Summary indexing and re-ranking add token costs. Balance improvement against budget.
Vector Database: For production, use managed solutions like Pinecone, Weaviate, or pgvector instead of in-memory stores.
Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) based on your document structure.

Key Takeaways

Evaluate systematically: Separate retrieval metrics (precision, recall, F1, MRR) from end-to-end accuracy to identify bottlenecks in your RAG pipeline.
Summary indexing improves context: Adding document summaries helps Claude understand the big picture before diving into details, improving recall by 5%.
Re-ranking with Claude boosts relevance: Using Claude to reorder retrieved chunks improved MRR by 18%, ensuring the most relevant information appears first.
Start simple, then optimize: Begin with basic RAG, establish your evaluation baseline, then incrementally add sophistication.
Monitor costs vs. benefits: Advanced techniques like summary generation and re-ranking add token costs. Measure whether the accuracy gains justify the expense for your use case.