Guide2026-05-06

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn to build and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeEvaluationVector SearchPrompt Engineering

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your unique business context. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude tap into your proprietary data to deliver accurate, context-aware answers.

In this guide, we'll walk through building and optimizing a RAG system using Claude and the Anthropic documentation as our knowledge base. You'll learn how to move from a basic "naive RAG" implementation to an advanced system that achieves measurable improvements in retrieval quality and end-to-end accuracy.

What You'll Learn

How to set up a basic RAG pipeline with Claude and Voyage AI embeddings
How to build a robust evaluation suite with production-grade metrics
How to implement summary indexing for better retrieval coverage
How to use Claude as a re-ranker to improve result relevance
How to measure and optimize precision, recall, F1, MRR, and end-to-end accuracy

Prerequisites

Before diving in, make sure you have:

An Anthropic API key for accessing Claude
A Voyage AI API key for generating embeddings
Python 3.8+ installed
Basic familiarity with Python and vector databases

Install the required libraries:

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Level 1: Basic RAG Pipeline

Let's start with a simple "naive RAG" implementation. This three-step process forms the foundation of any RAG system:

Chunk documents by heading (each subheading becomes a separate chunk)
Embed each chunk using Voyage AI's embedding model
Retrieve relevant chunks using cosine similarity when a query comes in

Setting Up the Vector Database

For this example, we'll use an in-memory vector store. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[Dict[str, str]]):
        texts = [doc["content"] for doc in documents]
        response = self.client.embed(texts, model="voyage-2")
        self.embeddings.extend(response.embeddings)
        self.documents.extend(documents)
    
    def search(self, query: str, k: int = 3) -> List[Dict]:
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        scores = [
            np.dot(query_embedding, doc_emb)
            for doc_emb in self.embeddings
        ]
        top_indices = np.argsort(scores)[-k:][::-1]
        return [self.documents[i] for i in top_indices]

Implementing the Basic RAG Query

from anthropic import Anthropic
class BasicRAG:
    def __init__(self, vector_db, anthropic_api_key: str):
        self.vector_db = vector_db
        self.anthropic = Anthropic(api_key=anthropic_api_key)
    
    def query(self, question: str) -> str:
        # Step 1: Retrieve relevant chunks
        chunks = self.vector_db.search(question, k=3)
        
        # Step 2: Build context from chunks
        context = "\n\n".join([chunk["content"] for chunk in chunks])
        
        # Step 3: Generate answer with Claude
        response = self.anthropic.messages.create(
            model="claude-3-sonnet-20240229",
            max_tokens=1024,
            system="You are a helpful assistant. Answer the question based on the provided context.",
            messages=[
                {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"}
            ]
        )
        return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it in production. You need objective metrics to measure and improve your RAG system. We'll evaluate two independent components:

Retrieval performance: How well does the system find relevant chunks?
End-to-end performance: How accurate are the final answers?

Creating a Synthetic Evaluation Dataset

Generate 100+ test samples, each containing:

A question
The correct answer
The relevant document chunks that should be retrieved

import json
Example evaluation sample
{
    "question": "What is the maximum context window for Claude 3 Opus?",
    "correct_answer": "Claude 3 Opus supports up to 200,000 tokens of context.",
    "relevant_chunks": [
        "Claude 3 Opus features a 200,000 token context window...",
        "The context window allows processing large documents..."
    ]
}

Key Retrieval Metrics

#### Precision Precision measures how many of the retrieved chunks are actually relevant. High precision means fewer false positives.

Precision = |Retrieved ∩ Correct| / |Retrieved|

#### Recall Recall measures how many of the relevant chunks were retrieved. High recall means you're not missing important information.

Recall = |Retrieved ∩ Correct| / |Correct|

#### F1 Score The harmonic mean of precision and recall, giving a balanced view of retrieval quality.

F1 = 2  (Precision  Recall) / (Precision + Recall)

#### Mean Reciprocal Rank (MRR) MRR evaluates how early the first relevant chunk appears in your results. A high MRR means users see relevant information quickly.

def calculate_mrr(retrieved_chunks, correct_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in correct_chunks:
            return 1.0 / (i + 1)
    return 0.0

End-to-End Accuracy

This measures whether Claude's final answer is correct. You can use LLM-as-judge or manual evaluation.

def evaluate_end_to_end(rag_system, eval_dataset):
    correct = 0
    for sample in eval_dataset:
        answer = rag_system.query(sample["question"])
        # Use Claude to judge correctness
        judgment = judge_answer(answer, sample["correct_answer"])
        if judgment == "correct":
            correct += 1
    return correct / len(eval_dataset)

Level 2: Summary Indexing

Basic chunking often misses the forest for the trees. Summary indexing adds a high-level overview of each document section to improve retrieval.

def create_summary_index(documents):
    summary_db = InMemoryVectorDB(api_key=VOYAGE_API_KEY)
    
    for doc in documents:
        # Generate a summary using Claude
        summary = generate_summary(doc["content"])
        
        # Store both the summary and original content
        summary_db.add_documents([
            {"content": summary, "type": "summary", "original": doc},
            {"content": doc["content"], "type": "full", "original": doc}
        ])
    
    return summary_db

When a query comes in, search both summaries and full chunks. This improves recall by helping the system find relevant documents even when the query doesn't match specific chunk text.

Level 3: Re-Ranking with Claude

Re-ranking takes the top-k results from your initial retrieval and uses Claude to reorder them by relevance. This dramatically improves MRR.

def rerank_with_claude(query: str, candidates: List[Dict]) -> List[Dict]:
    prompt = f"""
    Given the query: "{query}"
    
    Rank the following passages by relevance (most relevant first):
    
    {chr(10).join([f"{i+1}. {c['content']}" for i, c in enumerate(candidates)])}
    
    Return the indices in order of relevance, separated by commas.
    """
    
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the response and reorder candidates
    indices = [int(i) - 1 for i in response.content[0].text.split(",")]
    return [candidates[i] for i in indices]

Results: Before and After

After implementing summary indexing and re-ranking, here's the improvement over the basic RAG pipeline:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The most dramatic improvement is in MRR (0.74 → 0.87), thanks to Claude's re-ranking capability. End-to-end accuracy jumped from 71% to 81%, meaning users get correct answers more often.

Production Considerations

Rate limits: Full evaluations can hit API rate limits. Consider running smaller eval sets or using Tier 2+ accounts.
Cost management: Summary indexing and re-ranking add token costs. Balance improvements against budget.
Vector database: For production, use a hosted vector DB with proper indexing and scaling.
Evaluation dataset: Maintain a diverse, evolving eval set that reflects real user queries.

Key Takeaways

Evaluate retrieval and generation separately to identify where your RAG system needs improvement. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
Summary indexing boosts recall by adding high-level document overviews that catch queries missed by chunk-level search.
Re-ranking with Claude dramatically improves MRR, ensuring the most relevant information appears first in your results.
Start simple, then iterate — a basic RAG pipeline can be surprisingly effective, and targeted improvements (like re-ranking) often yield the biggest gains.
Build a synthetic evaluation dataset early in your development process. It's the foundation for objective, reproducible improvements.