Guide2026-05-06

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn how to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeEvaluationVector SearchPrompt Engineering

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Claude excels at general knowledge tasks, but when you need answers grounded in your own documents—internal wikis, product manuals, or customer support logs—standard prompting falls short. That's where Retrieval Augmented Generation (RAG) comes in.

RAG lets Claude tap into your private knowledge bases, dramatically improving its ability to answer domain-specific questions. Enterprises are using RAG to power customer support bots, internal Q&A systems, financial analysis tools, and more.

In this guide, we'll build a RAG system using Claude and the Claude Documentation as our knowledge base. We'll start with a basic pipeline, then level up with advanced techniques that measurably improve performance.

What You'll Learn

How to set up a basic RAG pipeline with Claude, Voyage AI embeddings, and an in-memory vector store
How to build a robust evaluation suite that measures retrieval and end-to-end performance independently
How to implement summary indexing and re-ranking to boost accuracy from 71% to 81%

Prerequisites

Python 3.8+
API keys from Anthropic and Voyage AI
Basic familiarity with Python and LLM concepts

Step 1: Basic RAG Setup

Let's start with what the industry calls "Naive RAG." It's simple but effective for many use cases.

Install Dependencies

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Initialize the Vector Database

We'll use an in-memory vector DB for this example. For production, consider hosted solutions like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[Dict[str, str]]):
        """Add documents with their embeddings."""
        texts = [doc["content"] for doc in documents]
        embeddings = self.client.embed(texts, model="voyage-2").embeddings
        self.documents.extend(documents)
        self.embeddings.extend(embeddings)
    
    def search(self, query: str, k: int = 3) -> List[Dict]:
        """Retrieve top-k documents by cosine similarity."""
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        similarities = [
            np.dot(query_embedding, doc_emb)
            / (np.linalg.norm(query_embedding) * np.linalg.norm(doc_emb))
            for doc_emb in self.embeddings
        ]
        top_indices = np.argsort(similarities)[-k:][::-1]
        return [self.documents[i] for i in top_indices]

Build the Basic RAG Pipeline

from anthropic import Anthropic
class BasicRAG:
    def __init__(self, anthropic_key: str, voyage_key: str):
        self.anthropic = Anthropic(api_key=anthropic_key)
        self.vector_db = InMemoryVectorDB(api_key=voyage_key)
    
    def answer(self, query: str) -> str:
        # 1. Retrieve relevant chunks
        chunks = self.vector_db.search(query, k=3)
        context = "\n\n".join([chunk["content"] for chunk in chunks])
        
        # 2. Generate answer with Claude
        response = self.anthropic.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=1024,
            messages=[{
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context above."
            }]
        )
        return response.content[0].text

Step 2: Building an Evaluation System

"Vibes-based" evaluation won't cut it for production. You need metrics that tell you exactly where your system excels and where it falls short.

Create a Synthetic Evaluation Dataset

Generate 100 test samples, each containing:

A question
The correct answer
The relevant document chunks (ground truth for retrieval)

import json
Example structure of evaluation dataset
eval_sample = {
    "question": "How do I set up streaming with Claude?",
    "answer": "To set up streaming, use the stream=True parameter...",
    "relevant_chunks": [
        "Streaming allows you to receive partial responses...",
        "Set stream=True in the Messages API call..."
    ]
}
Save to file
with open("evaluation_dataset.json", "w") as f:
    json.dump(eval_samples, f, indent=2)

Define Key Metrics

We evaluate two dimensions separately: retrieval quality and end-to-end accuracy.

#### Retrieval Metrics

Precision – Of the chunks retrieved, how many are relevant?

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

Recall – Of all relevant chunks, how many did we retrieve?

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

F1 Score – Harmonic mean of precision and recall. Mean Reciprocal Rank (MRR) – How high is the first relevant chunk in the results? Critical for question-answering where one good chunk may be enough.

#### End-to-End Metric

Accuracy – Does Claude's final answer match the ground truth? Use another LLM call to judge correctness.

def evaluate_retrieval(retrieved_chunks, relevant_chunks):
    retrieved_set = set(retrieved_chunks)
    relevant_set = set(relevant_chunks)
    
    true_positives = len(retrieved_set & relevant_set)
    
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(relevant_set) if relevant_set else 0
    f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1
def evaluate_end_to_end(question, predicted_answer, correct_answer):
    # Use Claude to judge correctness
    prompt = f"""Question: {question}
Predicted Answer: {predicted_answer}
Correct Answer: {correct_answer}
Is the predicted answer correct? Answer only 'yes' or 'no'."""
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text.strip().lower() == "yes"

Step 3: Advanced Techniques

Level 2: Summary Indexing

Basic RAG chunks by heading, which can miss cross-chunk relationships. Summary indexing creates a separate index of chunk summaries, making retrieval more robust.

def create_summary_index(documents):
    summaries = []
    for doc in documents:
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=200,
            messages=[{
                "role": "user",
                "content": f"Summarize this document in 1-2 sentences:\n\n{doc['content']}"
            }]
        )
        summaries.append({
            "original": doc,
            "summary": response.content[0].text
        })
    return summaries

Level 3: Summary Indexing + Re-Ranking

Re-ranking uses Claude to reorder retrieved chunks by relevance to the query. This dramatically improves MRR.

def rerank_with_claude(query, chunks, top_k=3):
    # First pass: retrieve more chunks than needed
    initial_chunks = vector_db.search(query, k=10)
    
    # Second pass: ask Claude to rank them
    chunk_texts = [f"Chunk {i}: {c['content']}" for i, c in enumerate(initial_chunks)]
    prompt = f"""Query: {query}
Chunks:
{"".join(chunk_texts)}
Rank the chunks by relevance to the query. Return the indices of the top {top_k} chunks, most relevant first, as a comma-separated list."""
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    indices = [int(i.strip()) for i in response.content[0].text.split(",")[:top_k]]
    return [initial_chunks[i] for i in indices]

Results: Measurable Improvements

After implementing summary indexing and re-ranking, we saw significant gains:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The biggest win was in MRR (+0.13) and end-to-end accuracy (+10%), showing that re-ranking helps Claude find the most relevant information faster.

Production Considerations

Rate Limits: Full evaluations may hit rate limits below Tier 2. Consider sampling your dataset or running evaluations incrementally.
Vector Database: For production, replace the in-memory DB with a hosted solution (Pinecone, Weaviate, Qdrant).
Chunking Strategy: Experiment with different chunk sizes (256-1024 tokens) and overlap (10-20%).
Embedding Model: Voyage AI embeddings are optimized for retrieval, but you can also use OpenAI's text-embedding-3-small or open-source models like BAAI/bge-base-en-v1.5.

Key Takeaways

Evaluate retrieval and generation separately. Use precision, recall, F1, and MRR for retrieval; use LLM-as-judge for end-to-end accuracy.
Summary indexing improves recall by capturing document-level semantics that chunk-level indexing misses.
Re-ranking with Claude boosts MRR significantly (from 0.74 to 0.87) by applying semantic understanding to initial retrieval results.
Start simple, then iterate. A basic RAG pipeline can be surprisingly effective. Add complexity only when metrics show clear room for improvement.
Synthetic evaluation datasets are powerful. Generate 50-100 samples covering your domain to get reliable performance signals without manual labeling.