GuideBeginner2026-05-06

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide shows you how to build a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeRetrieval Augmented GenerationEvaluationVector Search

Building Production-Ready RAG Systems with Claude: From Basic to Advanced

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.

What You'll Learn

By the end of this guide, you will be able to:

Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
Build a robust evaluation suite that measures retrieval and end-to-end performance independently
Implement advanced techniques including summary indexing and re-ranking with Claude

We achieved significant performance gains through these techniques:

Avg Precision: 0.43 → 0.44
Avg Recall: 0.66 → 0.69
Avg F1 Score: 0.52 → 0.54
Avg Mean Reciprocal Rank (MRR): 0.74 → 0.87
End-to-End Accuracy: 71% → 81%

Prerequisites and Setup

Before diving in, you'll need:

API keys from Anthropic and Voyage AI
Python 3.8+ environment
Required libraries: anthropic, voyageai, pandas, numpy, matplotlib, scikit-learn

Initialize a Vector DB Class

In this example, we're using an in-memory vector DB. For production, consider a hosted solution like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[Dict[str, str]]):
        """Add documents with their embeddings."""
        texts = [doc["content"] for doc in documents]
        embeddings = self.client.embed(texts, model="voyage-2").embeddings
        self.documents.extend(documents)
        self.embeddings.extend(embeddings)
    
    def search(self, query: str, k: int = 3) -> List[Dict[str, Any]]:
        """Retrieve top-k documents by cosine similarity."""
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        scores = [self._cosine_similarity(query_embedding, emb) for emb in self.embeddings]
        top_indices = np.argsort(scores)[-k:][::-1]
        return [self.documents[i] for i in top_indices]
    
    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Level 1: Basic RAG (Naive RAG)

A basic RAG pipeline includes three steps:

Chunk documents by heading (containing only content from each subheading)
Embed each chunk using Voyage AI
Retrieve relevant chunks using cosine similarity

import anthropic
class BasicRAG:
    def __init__(self, anthropic_key: str, voyage_key: str):
        self.vector_db = InMemoryVectorDB(api_key=voyage_key)
        self.llm = anthropic.Anthropic(api_key=anthropic_key)
    
    def answer_query(self, query: str) -> str:
        # Retrieve relevant chunks
        chunks = self.vector_db.search(query, k=3)
        context = "\n\n".join([chunk["content"] for chunk in chunks])
        
        # Generate answer with Claude
        prompt = f"""Based on the following context, answer the user's question.
        If the context doesn't contain enough information, say so.
        
        Context:
        {context}
        
        Question: {query}
        
        Answer:"""
        
        response = self.llm.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the retrieval system and end-to-end system separately. We synthetically generated an evaluation dataset of 100 samples, each containing:

A question
Relevant chunks (ground truth)
A correct answer

This dataset is intentionally challenging—some questions require synthesis across multiple chunks.

Key Metrics Explained

#### Retrieval Metrics

Precision measures the proportion of retrieved chunks that are actually relevant. High precision means fewer false positives.

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

Recall measures completeness—how many of the correct chunks were retrieved. High recall ensures the LLM has all necessary information.

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

F1 Score is the harmonic mean of precision and recall. Mean Reciprocal Rank (MRR) evaluates how early the first relevant chunk appears in the results. This is crucial because Claude has limited context and may not process all retrieved chunks equally.

#### End-to-End Metric

End-to-End Accuracy measures whether Claude's final answer is correct based on the retrieved context.

from sklearn.metrics import precision_score, recall_score, f1_score
def evaluate_retrieval(rag_system, eval_dataset):
    """Evaluate retrieval performance."""
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for item in eval_dataset:
        query = item["question"]
        correct_chunks = set(item["relevant_chunks"])
        
        # Retrieve chunks
        retrieved = rag_system.vector_db.search(query, k=3)
        retrieved_ids = set([doc["id"] for doc in retrieved])
        
        # Calculate metrics
        true_positives = len(retrieved_ids & correct_chunks)
        precision = true_positives / len(retrieved) if retrieved else 0
        recall = true_positives / len(correct_chunks) if correct_chunks else 0
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR: reciprocal rank of first relevant chunk
        mrr = 0
        for rank, doc in enumerate(retrieved, 1):
            if doc["id"] in correct_chunks:
                mrr = 1 / rank
                break
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

Basic RAG struggles when relevant information is spread across multiple chunks. Summary indexing addresses this by creating summary-level embeddings that capture the essence of larger document sections.

def create_summary_index(documents, llm_client):
    """Create summary embeddings for document sections."""
    summary_index = []
    
    for doc_section in documents:
        # Generate a summary of the section
        prompt = f"Summarize the following content in 2-3 sentences:\n\n{doc_section['content']}"
        response = llm_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=150,
            messages=[{"role": "user", "content": prompt}]
        )
        summary = response.content[0].text
        
        # Embed the summary instead of raw content
        summary_index.append({
            "summary": summary,
            "original_content": doc_section["content"],
            "id": doc_section["id"]
        })
    
    return summary_index

This technique improved recall by capturing broader context, allowing Claude to retrieve more relevant information even when the exact query terms don't appear in the target chunk.

Level 3: Summary Indexing + Re-Ranking

The final level combines summary indexing with re-ranking using Claude. After initial retrieval, Claude re-ranks the chunks based on relevance to the query.

def rerank_with_claude(query, retrieved_chunks, llm_client):
    """Re-rank retrieved chunks using Claude."""
    # Prepare chunks for re-ranking
    chunks_text = ""
    for i, chunk in enumerate(retrieved_chunks):
        chunks_text += f"[{i+1}] {chunk['content'][:500]}...\n\n"
    
    prompt = f"""Given the query: "{query}"
    
    Rank the following chunks from most relevant (1) to least relevant ({len(retrieved_chunks)}).
    Return only the ranked list of numbers, separated by commas.
    
    {chunks_text}
    """
    
    response = llm_client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=50,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse ranking
    ranking = [int(x.strip()) - 1 for x in response.content[0].text.split(",")]
    return [retrieved_chunks[i] for i in ranking]

Re-ranking dramatically improved MRR from 0.74 to 0.87, ensuring the most relevant context appears first in Claude's context window.

Putting It All Together: The Optimized Pipeline

class OptimizedRAG:
    def __init__(self, anthropic_key: str, voyage_key: str):
        self.vector_db = InMemoryVectorDB(api_key=voyage_key)
        self.llm = anthropic.Anthropic(api_key=anthropic_key)
        self.summary_index = None
    
    def initialize_with_summaries(self, documents):
        self.summary_index = create_summary_index(documents, self.llm)
        # Add summary embeddings to vector DB
        for item in self.summary_index:
            self.vector_db.add_documents([{
                "id": item["id"],
                "content": item["summary"]
            }])
    
    def answer_query(self, query: str) -> str:
        # Retrieve using summary index
        initial_chunks = self.vector_db.search(query, k=5)
        
        # Map back to original content
        original_chunks = []
        for chunk in initial_chunks:
            for item in self.summary_index:
                if item["id"] == chunk["id"]:
                    original_chunks.append({
                        "content": item["original_content"],
                        "id": item["id"]
                    })
                    break
        
        # Re-rank with Claude
        reranked = rerank_with_claude(query, original_chunks, self.llm)
        context = "\n\n".join([chunk["content"] for chunk in reranked[:3]])
        
        # Generate final answer
        prompt = f"""Based on the following context, answer the user's question.
        
        Context:
        {context}
        
        Question: {query}
        
        Answer:"""
        
        response = self.llm.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )
        return response.content[0].text

Performance Comparison

Metric	Basic RAG	Summary Indexing	Summary + Re-Ranking
Avg Precision	0.43	0.43	0.44
Avg Recall	0.66	0.68	0.69
Avg F1 Score	0.52	0.53	0.54
Avg MRR	0.74	0.80	0.87
End-to-End Accuracy	71%	76%	81%

Best Practices for Production RAG

Evaluate retrieval and generation separately – This helps you identify where the bottleneck is
Use a diverse evaluation dataset – Include questions that require single-chunk, multi-chunk, and edge-case reasoning
Monitor MRR closely – It directly impacts how well Claude can use the retrieved context
Consider chunk overlap – Overlapping chunks can improve recall at the cost of more tokens
Test with different embedding models – Voyage AI, OpenAI, and Cohere all offer strong options

Key Takeaways

Start simple, then optimize: A basic RAG pipeline works for many use cases. Add complexity (summary indexing, re-ranking) only when metrics show room for improvement.
Measure what matters: Separate retrieval metrics (precision, recall, F1, MRR) from end-to-end accuracy. This pinpoints whether the issue is retrieval or generation.
Re-ranking with Claude significantly improves MRR: From 0.74 to 0.87 in our tests, ensuring the most relevant context appears first in Claude's context window.
Summary indexing boosts recall: By capturing broader document context, you retrieve more relevant information even when exact query terms are missing.
Production RAG requires continuous evaluation: Your evaluation dataset should evolve with your use case, and metrics should be tracked over time to catch regressions.