GuideBeginnerBest Practices2026-05-16

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques for production-grade performance.

Quick Answer

This guide teaches you to build production-ready RAG systems with Claude, covering basic setup with Voyage AI embeddings, comprehensive evaluation using precision/recall/MRR metrics, and advanced optimization techniques like summary indexing and re-ranking to boost end-to-end accuracy from 71% to 81%.

RAGRetrieval Augmented GenerationClaude APIVector SearchLLM Evaluation

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your proprietary data. While Claude excels at general knowledge tasks, it can't know your internal documentation, customer support history, or proprietary research. RAG bridges this gap by dynamically retrieving relevant information from your knowledge base and injecting it into Claude's context window.

In this guide, we'll walk through building a production-grade RAG system using the Anthropic Cookbook's reference implementation. We'll start with a basic "naive" RAG pipeline, then systematically improve it using advanced techniques like summary indexing and re-ranking. Along the way, we'll build a proper evaluation framework—because without measurement, you're just guessing.

Understanding the RAG Architecture

Before diving into code, let's understand the three core components of any RAG system:

Ingestion Pipeline: Chunks documents, generates embeddings, and stores them in a vector database
Retrieval System: Takes a user query, embeds it, and finds the most semantically similar document chunks
Generation System: Passes retrieved chunks to Claude along with the original query for answer generation

The key insight? Each component can be independently optimized and evaluated.

Level 1: Building a Basic RAG Pipeline

Let's start with what the industry calls "Naive RAG"—a straightforward implementation that gets the job done but has plenty of room for improvement.

Setup and Dependencies

First, install the required libraries:

pip install anthropic voyageai pandas numpy scikit-learn matplotlib

You'll need API keys from both Anthropic and Voyage AI. Voyage AI provides specialized embedding models that outperform general-purpose alternatives for retrieval tasks.

Initializing the Vector Database

For this example, we'll use an in-memory vector store. In production, you'd likely use Pinecone, Weaviate, or another hosted solution.

import voyageai
import numpy as np
from typing import List, Dict, Tuple
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[Dict[str, str]]):
        """Add documents with their embeddings"""
        texts = [doc["content"] for doc in documents]
        response = self.client.embed(texts, model="voyage-2")
        self.embeddings.extend(response.embeddings)
        self.documents.extend(documents)
    
    def search(self, query: str, k: int = 3) -> List[Tuple[Dict[str, str], float]]:
        """Retrieve top-k most similar documents"""
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        similarities = [
            np.dot(query_embedding, doc_emb)
            for doc_emb in self.embeddings
        ]
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [(self.documents[i], similarities[i]) for i in top_indices]

The Basic RAG Pipeline

Our naive approach follows three steps:

Chunk documents by heading—each section becomes a separate chunk
Embed each chunk using Voyage AI's embedding model
Retrieve top-k chunks using cosine similarity and pass them to Claude

from anthropic import Anthropic
class BasicRAG:
    def __init__(self, anthropic_key: str, voyage_key: str):
        self.vector_db = InMemoryVectorDB(voyage_key)
        self.llm = Anthropic(api_key=anthropic_key)
    
    def query(self, question: str) -> str:
        # Step 1: Retrieve relevant chunks
        retrieved = self.vector_db.search(question, k=3)
        context = "\n\n---\n\n".join([doc["content"] for doc, _ in retrieved])
        
        # Step 2: Generate answer with Claude
        response = self.llm.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }]
        )
        return response.content[0].text

This works, but it has limitations. Chunks are arbitrary, retrieval quality is inconsistent, and we have no way to measure performance.

Building a Comprehensive Evaluation System

"Vibes-based" evaluation won't cut it for production systems. We need quantitative metrics that measure both retrieval quality and end-to-end answer accuracy.

Creating a Synthetic Evaluation Dataset

The Anthropic Cookbook provides a dataset of 100 samples, each containing:

A question
Ground-truth relevant chunks
A correct answer

This dataset is intentionally challenging—some questions require synthesizing information across multiple chunks.

import json
with open("evaluation/docs_evaluation_dataset.json", "r") as f:
    eval_data = json.load(f)
Preview the first sample
print(json.dumps(eval_data[0], indent=2))

Retrieval Metrics

We evaluate retrieval quality using four standard metrics:

Precision: Of the chunks we retrieved, how many were actually relevant?

Precision = |Retrieved ∩ Correct| / |Retrieved|

Recall: Of all the correct chunks that exist, how many did we retrieve?

Recall = |Retrieved ∩ Correct| / |Correct|

F1 Score: The harmonic mean of precision and recall. Mean Reciprocal Rank (MRR): How early in the results does the first relevant chunk appear? This is crucial because Claude's context window is limited—if the relevant chunk is buried, it might get cut off.

End-to-End Accuracy

This measures whether Claude's final answer is correct given the retrieved context. It's the ultimate test—even perfect retrieval is useless if Claude can't synthesize the information correctly.

def evaluate_retrieval(rag_system, eval_data):
    """Evaluate retrieval metrics across the dataset"""
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for sample in eval_data:
        retrieved = rag_system.vector_db.search(sample["question"], k=3)
        retrieved_ids = {doc["id"] for doc, _ in retrieved}
        correct_ids = set(sample["relevant_chunk_ids"])
        
        # Calculate metrics
        true_positives = len(retrieved_ids & correct_ids)
        precision = true_positives / len(retrieved_ids)
        recall = true_positives / len(correct_ids)
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant result
        for rank, (doc, _) in enumerate(retrieved, 1):
            if doc["id"] in correct_ids:
                mrr = 1.0 / rank
                break
        else:
            mrr = 0.0
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Our basic RAG has a fundamental problem: chunks are too granular. A single heading might contain multiple distinct concepts, and relevant information might span across headings.

Summary indexing solves this by creating higher-level summaries of document sections. Instead of retrieving raw chunks, we retrieve summaries that provide broader context.

def create_summary_index(documents, llm_client):
    """Create summary embeddings for document sections"""
    summaries = []
    for doc in documents:
        response = llm_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize this section in 2-3 sentences:\n\n{doc['content']}"
            }]
        )
        summaries.append({
            "id": doc["id"],
            "summary": response.content[0].text,
            "original_content": doc["content"]
        })
    return summaries

During retrieval, we embed the query and search against summary embeddings. Once we find relevant summaries, we retrieve the corresponding full chunks for Claude's context. This dramatically improves recall because summaries capture the essence of longer passages.

Level 3: Adding Re-Ranking

Even with summary indexing, our initial retrieval might miss the mark. Re-ranking adds a second stage: after retrieving top-k candidates, we use Claude to score and reorder them based on relevance to the query.

def rerank_with_claude(query: str, candidates: List[Dict], llm_client) -> List[Dict]:
    """Use Claude to re-rank retrieved chunks by relevance"""
    scored = []
    for chunk in candidates:
        response = llm_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"On a scale of 0-10, how relevant is this chunk to the query?\n\nQuery: {query}\n\nChunk: {chunk['content'][:500]}\n\nRelevance score (just the number):"
            }]
        )
        score = float(response.content[0].text.strip())
        scored.append((chunk, score))
    
    # Sort by score descending
    scored.sort(key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in scored]

Re-ranking dramatically improves MRR—the first relevant chunk appears much earlier in the results. This is critical because Claude's attention is strongest on the first few chunks in its context window.

Results: The Impact of Optimization

After implementing summary indexing and re-ranking, the improvements are significant:

Metric	Basic RAG	Optimized RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The most dramatic improvement is in MRR (0.74 → 0.87), showing that relevant chunks now appear much earlier in results. This translates to a 10 percentage point improvement in end-to-end accuracy.

Production Considerations

When moving to production, consider these additional factors:

Rate Limits: Full evaluations can hit API rate limits. Use Tier 2+ accounts and consider sampling your evaluation dataset.
Chunking Strategy: Experiment with different chunk sizes and overlap. The optimal size depends on your document structure.
Embedding Model: Voyage AI's voyage-2 is excellent, but test alternatives like text-embedding-3-small from OpenAI.
Caching: Cache embeddings for frequently accessed documents to reduce API calls and latency.
Monitoring: Log retrieval metrics in production to detect degradation over time.

Key Takeaways

Measure what matters: Build separate evaluation pipelines for retrieval quality (precision, recall, F1, MRR) and end-to-end accuracy. Without metrics, you can't optimize.
Summary indexing beats raw chunking: Creating summary-level embeddings significantly improves recall by capturing the essence of longer passages.
Re-ranking is worth the latency: A second-stage re-ranking pass with Claude dramatically improves MRR, ensuring the most relevant context appears first.
Optimize iteratively: Start with basic RAG, measure baseline performance, then systematically apply improvements. The 10-point accuracy gain from 71% to 81% came from targeted, measurable changes.
Synthetic evaluation datasets are powerful: Generate challenging test cases that require multi-chunk synthesis to truly stress-test your system.