Guide2026-04-30

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics, with concrete code examples.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities with your own data. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your specific documents.

In this guide, we'll walk through building a RAG system using Claude and Voyage AI embeddings, using the Claude documentation as our knowledge base. We'll start with a basic implementation, then show you how to measure performance properly, and finally introduce advanced techniques that can significantly boost your results.

What You'll Learn

How to set up a basic RAG pipeline with Claude
How to build a robust evaluation suite for retrieval and end-to-end performance
How to implement summary indexing for better context capture
How to use re-ranking to improve answer quality

Prerequisites

You'll need:

An Anthropic API key
A Voyage AI API key
Python 3.8+ with anthropic, voyageai, pandas, numpy, matplotlib, and scikit-learn installed

Level 1: Basic RAG Pipeline

Let's start with what's often called "Naive RAG" — a straightforward three-step process:

Chunk your documents by headings
Embed each chunk using Voyage AI
Retrieve the most relevant chunks using cosine similarity

Here's how to set up the basic pipeline:

import voyageai
from anthropic import Anthropic
import numpy as np
Initialize clients
vo = voyageai.Client(api_key="your-voyage-api-key")
anthropic = Anthropic(api_key="your-anthropic-api-key")
class BasicRAG:
    def __init__(self, documents):
        self.documents = documents
        self.embeddings = self._embed_documents(documents)
    
    def _embed_documents(self, docs):
        result = vo.embed(docs, model="voyage-2")
        return np.array(result.embeddings)
    
    def retrieve(self, query, k=3):
        query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [self.documents[i] for i in top_indices]
    
    def answer(self, query):
        chunks = self.retrieve(query)
        context = "\n\n".join(chunks)
        response = anthropic.messages.create(
            model="claude-3-sonnet-20240229",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Answer the question based on this context:\n\n{context}\n\nQuestion: {query}"
            }]
        )
        return response.content[0].text

This works, but how do you know if it's working well? That's where evaluation comes in.

Building an Evaluation System

Don't rely on "vibes" — measure your RAG system properly. The key insight is to evaluate retrieval and end-to-end performance separately.

Creating an Evaluation Dataset

Generate a synthetic evaluation set with 100+ samples. Each sample should include:

A question
The correct chunks (ground truth for retrieval)
A correct answer (ground truth for end-to-end)

# Example evaluation sample structure
eval_sample = {
    "question": "How do I set up streaming with Claude?",
    "relevant_chunks": ["chunk_42.txt", "chunk_43.txt"],
    "correct_answer": "To set up streaming, use the stream=True parameter..."
}

Key Metrics

#### Retrieval Metrics

Precision — Of the chunks retrieved, how many are relevant?

Precision = True Positives / Total Retrieved

High precision means fewer irrelevant chunks cluttering the context. Recall — Of all relevant chunks, how many did we retrieve?

Recall = True Positives / Total Relevant

High recall ensures Claude has all the information it needs. F1 Score — Harmonic mean of precision and recall. Mean Reciprocal Rank (MRR) — How high is the first relevant chunk in the results? Crucial for question-answering where one good chunk might be enough.

#### End-to-End Metric

Accuracy — Does Claude produce the correct answer? This is the ultimate test of your system.

Implementing the Evaluation

def evaluate_retrieval(rag_system, eval_dataset):
    precisions, recalls, mrrs = [], [], []
    
    for sample in eval_dataset:
        retrieved = rag_system.retrieve(sample["question"])
        relevant = sample["relevant_chunks"]
        
        # Calculate metrics
        true_positives = len(set(retrieved) & set(relevant))
        precision = true_positives / len(retrieved)
        recall = true_positives / len(relevant)
        
        # MRR: reciprocal rank of first relevant result
        for rank, chunk in enumerate(retrieved, 1):
            if chunk in relevant:
                mrr = 1.0 / rank
                break
        
        precisions.append(precision)
        recalls.append(recall)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean([2pr/(p+r) if p+r > 0 else 0 
                          for p, r in zip(precisions, recalls)]),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic chunking by headings misses the bigger picture. Summary indexing creates an additional index where each chunk is paired with a summary of its broader context.

def create_summary_index(documents, chunk_size=3):
    """Create summaries for groups of chunks."""
    summary_index = []
    for i in range(0, len(documents), chunk_size):
        group = documents[i:i+chunk_size]
        combined = "\n".join(group)
        
        response = anthropic.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize this content in 2-3 sentences:\n\n{combined}"
            }]
        )
        summary = response.content[0].text
        
        # Store both summary and original chunks
        summary_index.append({
            "summary": summary,
            "chunks": group,
            "embedding": vo.embed([summary], model="voyage-2").embeddings[0]
        })
    
    return summary_index

This improves recall by helping the retriever find relevant content even when the query doesn't match exact keywords in the chunk.

Level 3: Summary Indexing + Re-Ranking

The most advanced technique combines summary indexing with re-ranking. After retrieving candidates, use Claude to re-rank them by relevance to the query.

def rerank_with_claude(query, candidates, top_k=3):
    """Use Claude to re-rank retrieved chunks."""
    prompt = f"""Given the question: "{query}"
Rate each chunk from 1-10 for relevance (10 = most relevant).
Return only the chunk indices sorted by relevance, highest first.
Chunks:
"""
    for i, chunk in enumerate(candidates):
        prompt += f"\n[{i}]: {chunk[:200]}..."
    
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the response to get re-ranked indices
    # (In production, use structured output)
    return response.content[0].text

Results: What You Can Expect

With these improvements, the guide's authors achieved:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The biggest gains came in MRR (getting the right chunk to the top) and end-to-end accuracy.

Production Considerations

Vector Database: For production, replace the in-memory store with Pinecone, Weaviate, or pgvector
Rate Limits: Full evaluations can hit rate limits — consider Tier 2+ or sample-based testing
Cost: Summary indexing and re-ranking add token usage; optimize by caching summaries

Key Takeaways

Evaluate retrieval and end-to-end performance separately — This lets you pinpoint whether issues are in finding information or in answering with it.

Summary indexing improves recall by capturing the broader context around individual chunks, helping Claude find relevant information even with imperfect queries.

Re-ranking with Claude significantly boosts MRR — Getting the most relevant chunk to the top of the context window improves answer quality dramatically.

Start simple, measure, then optimize — A basic RAG pipeline can be surprisingly effective. Use data to decide where to invest in improvements.

Your evaluation dataset is your most important asset — Invest time in creating high-quality, representative test samples that reflect real user queries.