Guide2026-05-05

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to measure retrieval precision, recall, F1, MRR, and end-to-end accuracy, and see how targeted improvements boosted accuracy from 71% to 81%.

RAGClaudeVector SearchEvaluationPrompt Engineering

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more. In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base.

What You'll Learn

By the end of this guide, you'll be able to:

Set up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
Build a robust evaluation suite that measures retrieval and end-to-end performance independently
Implement advanced techniques including summary indexing and re-ranking with Claude

Prerequisites

You'll need:

Python 3.8+
API keys from Anthropic and Voyage AI
Basic familiarity with Python and vector databases

Required Libraries

pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Level 1: Basic RAG Pipeline

Let's start with what's often called "Naive RAG" — a straightforward three-step pipeline:

Chunk documents by heading, containing only the content from each subheading
Embed each chunk using Voyage AI's embedding model
Retrieve relevant chunks using cosine similarity

Initialize a Vector DB Class

For this example, we'll use an in-memory vector database. For production, consider hosted solutions like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[str]):
        self.documents.extend(documents)
        embeddings = self.client.embed(
            documents, 
            model="voyage-2"
        ).embeddings
        self.embeddings.extend(embeddings)
    
    def search(self, query: str, k: int = 3) -> List[str]:
        query_embedding = self.client.embed(
            [query], 
            model="voyage-2"
        ).embeddings[0]
        
        # Cosine similarity
        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[-k:][::-1]
        
        return [self.documents[i] for i in top_indices]

Query Claude with Retrieved Context

from anthropic import Anthropic
claude = Anthropic(api_key="your-anthropic-key")
def ask_claude_with_context(query: str, context_chunks: List[str]):
    context = "\n\n---\n\n".join(context_chunks)
    
    prompt = f"""Answer the following question based on the provided context.
Context:
{context}
Question: {query}
Answer:"""
    
    response = claude.messages.create(
        model="claude-3-sonnet-20240229",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the retrieval system and end-to-end system separately. "Vibes-based" evaluation won't cut it for production.

Creating an Evaluation Dataset

We synthetically generated an evaluation dataset of 100 samples, each containing:

A question
Relevant chunks from our docs (the ground truth for retrieval)
A correct answer (the ground truth for end-to-end)

This dataset is intentionally challenging — some questions require synthesis across multiple chunks.

Key Metrics

#### Retrieval Metrics

Precision — "Of the chunks we retrieved, how many were correct?"

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

High precision means few false positives. Low precision suggests irrelevant chunks are being retrieved.

Recall — "Of all correct chunks, how many did we retrieve?"

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

High recall means comprehensive coverage. Low recall suggests important chunks are being missed.

F1 Score — Harmonic mean of precision and recall

$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Mean Reciprocal Rank (MRR) — Measures how high the first relevant chunk appears in the results

$$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$$

#### End-to-End Metric

Accuracy — Does Claude's final answer match the ground truth? This is typically evaluated by a separate LLM judge or human annotator.

Implementing the Evaluation

def evaluate_retrieval(db, eval_dataset):
    precisions, recalls, f1s, mrrs = [], [], [], []
    
    for item in eval_dataset:
        retrieved = db.search(item["question"], k=3)
        relevant = set(item["relevant_chunks"])
        retrieved_set = set(retrieved)
        
        # Precision
        true_positives = len(retrieved_set & relevant)
        precision = true_positives / len(retrieved)
        
        # Recall
        recall = true_positives / len(relevant) if relevant else 0
        
        # F1
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        
        # MRR
        mrr = 0
        for i, chunk in enumerate(retrieved):
            if chunk in relevant:
                mrr = 1 / (i + 1)
                break
        
        precisions.append(precision)
        recalls.append(recall)
        f1s.append(f1)
        mrrs.append(mrr)
    
    return {
        "avg_precision": np.mean(precisions),
        "avg_recall": np.mean(recalls),
        "avg_f1": np.mean(f1s),
        "avg_mrr": np.mean(mrrs)
    }

Level 2: Summary Indexing

A major limitation of basic RAG is that individual chunks may lack context. Summary indexing addresses this by creating a two-tier index:

Summary chunks — A high-level summary of each section
Detailed chunks — The original content

When a query comes in, we first search summaries to identify the right section, then retrieve detailed chunks from that section.

def generate_summary(chunks: List[str]) -> str:
    prompt = f"Summarize the following content in 2-3 sentences:\n\n{' '.join(chunks)}"
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Level 3: Summary Indexing + Re-Ranking

Re-ranking adds a second stage to retrieval. After the initial vector search, Claude re-ranks the top-k results to ensure the most relevant chunks are used for generation.

def rerank_with_claude(query: str, chunks: List[str], top_n: int = 3) -> List[str]:
    prompt = f"""Given the query below, rank the following chunks by relevance.
Return only the chunk indices in order of relevance, separated by commas.
Query: {query}
Chunks:
"""
    for i, chunk in enumerate(chunks):
        prompt += f"\n[{i}] {chunk[:200]}..."
    
    prompt += "\n\nRelevant indices (comma-separated):"
    
    response = claude.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    indices = [int(i.strip()) for i in response.content[0].text.split(",")]
    return [chunks[i] for i in indices[:top_n]]

Results: Before and After

Through these targeted improvements, we achieved significant performance gains:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

The most dramatic improvement was in MRR (from 0.74 to 0.87), showing that re-ranking significantly improved the position of relevant chunks in the results.

Production Considerations

Rate Limits: Full evaluations can hit rate limits unless you're at Tier 2 or above. Consider sampling your dataset for quick iterations.
Token Usage: Summary indexing and re-ranking increase token consumption. Balance cost against accuracy needs.
Vector Database: For production, use a hosted vector database with built-in indexing and scaling.
Evaluation Dataset: Invest time in creating a high-quality, representative evaluation dataset. It's the foundation of all improvements.

Key Takeaways

Evaluate retrieval and generation separately — This lets you pinpoint where improvements are needed. A perfect retrieval system with poor generation (or vice versa) requires different fixes.
Use structured metrics — Precision, recall, F1, and MRR give you objective measures of retrieval quality. Don't rely on "vibes."
Summary indexing improves recall — By creating a two-tier index, you help Claude find the right context even when queries don't match exact chunk wording.
Re-ranking boosts MRR significantly — A second pass with Claude to re-order results ensures the most relevant information appears first, improving final answer quality.
Invest in your evaluation dataset — The 100-question synthetic dataset was the cornerstone of our optimization. Without it, we couldn't measure improvement.

Building a production-grade RAG system with Claude is an iterative process. Start simple, measure everything, and apply targeted improvements based on data. The 10% absolute accuracy gain we demonstrated (71% to 81%) is achievable with the techniques covered in this guide.