BeClaude
Guide2026-05-06

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Learn to build and optimize Retrieval Augmented Generation (RAG) systems with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics.

RAGClaudeEvaluationVector SearchPrompt Engineering

Building Production-Grade RAG Systems with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your unique business context. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude tap into your proprietary data to deliver accurate, context-aware answers.

In this guide, we'll walk through building and optimizing a RAG system using Claude and the Anthropic documentation as our knowledge base. You'll learn how to move from a basic "naive RAG" implementation to an advanced system that achieves measurable improvements in retrieval quality and end-to-end accuracy.

What You'll Learn

  • How to set up a basic RAG pipeline with Claude and Voyage AI embeddings
  • How to build a robust evaluation suite with production-grade metrics
  • How to implement summary indexing for better retrieval coverage
  • How to use Claude as a re-ranker to improve result relevance
  • How to measure and optimize precision, recall, F1, MRR, and end-to-end accuracy

Prerequisites

Before diving in, make sure you have:

Install the required libraries:
pip install anthropic voyageai pandas numpy matplotlib scikit-learn

Level 1: Basic RAG Pipeline

Let's start with a simple "naive RAG" implementation. This three-step process forms the foundation of any RAG system:

  • Chunk documents by heading (each subheading becomes a separate chunk)
  • Embed each chunk using Voyage AI's embedding model
  • Retrieve relevant chunks using cosine similarity when a query comes in

Setting Up the Vector Database

For this example, we'll use an in-memory vector store. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict

class InMemoryVectorDB: def __init__(self, api_key: str): self.client = voyageai.Client(api_key=api_key) self.documents = [] self.embeddings = [] def add_documents(self, documents: List[Dict[str, str]]): texts = [doc["content"] for doc in documents] response = self.client.embed(texts, model="voyage-2") self.embeddings.extend(response.embeddings) self.documents.extend(documents) def search(self, query: str, k: int = 3) -> List[Dict]: query_embedding = self.client.embed([query], model="voyage-2").embeddings[0] scores = [ np.dot(query_embedding, doc_emb) for doc_emb in self.embeddings ] top_indices = np.argsort(scores)[-k:][::-1] return [self.documents[i] for i in top_indices]

Implementing the Basic RAG Query

from anthropic import Anthropic

class BasicRAG: def __init__(self, vector_db, anthropic_api_key: str): self.vector_db = vector_db self.anthropic = Anthropic(api_key=anthropic_api_key) def query(self, question: str) -> str: # Step 1: Retrieve relevant chunks chunks = self.vector_db.search(question, k=3) # Step 2: Build context from chunks context = "\n\n".join([chunk["content"] for chunk in chunks]) # Step 3: Generate answer with Claude response = self.anthropic.messages.create( model="claude-3-sonnet-20240229", max_tokens=1024, system="You are a helpful assistant. Answer the question based on the provided context.", messages=[ {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"} ] ) return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it in production. You need objective metrics to measure and improve your RAG system. We'll evaluate two independent components:

  • Retrieval performance: How well does the system find relevant chunks?
  • End-to-end performance: How accurate are the final answers?

Creating a Synthetic Evaluation Dataset

Generate 100+ test samples, each containing:

  • A question
  • The correct answer
  • The relevant document chunks that should be retrieved
import json

Example evaluation sample

{ "question": "What is the maximum context window for Claude 3 Opus?", "correct_answer": "Claude 3 Opus supports up to 200,000 tokens of context.", "relevant_chunks": [ "Claude 3 Opus features a 200,000 token context window...", "The context window allows processing large documents..." ] }

Key Retrieval Metrics

#### Precision Precision measures how many of the retrieved chunks are actually relevant. High precision means fewer false positives.

Precision = |Retrieved ∩ Correct| / |Retrieved|

#### Recall Recall measures how many of the relevant chunks were retrieved. High recall means you're not missing important information.

Recall = |Retrieved ∩ Correct| / |Correct|

#### F1 Score The harmonic mean of precision and recall, giving a balanced view of retrieval quality.

F1 = 2  (Precision  Recall) / (Precision + Recall)

#### Mean Reciprocal Rank (MRR) MRR evaluates how early the first relevant chunk appears in your results. A high MRR means users see relevant information quickly.

def calculate_mrr(retrieved_chunks, correct_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in correct_chunks:
            return 1.0 / (i + 1)
    return 0.0

End-to-End Accuracy

This measures whether Claude's final answer is correct. You can use LLM-as-judge or manual evaluation.

def evaluate_end_to_end(rag_system, eval_dataset):
    correct = 0
    for sample in eval_dataset:
        answer = rag_system.query(sample["question"])
        # Use Claude to judge correctness
        judgment = judge_answer(answer, sample["correct_answer"])
        if judgment == "correct":
            correct += 1
    return correct / len(eval_dataset)

Level 2: Summary Indexing

Basic chunking often misses the forest for the trees. Summary indexing adds a high-level overview of each document section to improve retrieval.

def create_summary_index(documents):
    summary_db = InMemoryVectorDB(api_key=VOYAGE_API_KEY)
    
    for doc in documents:
        # Generate a summary using Claude
        summary = generate_summary(doc["content"])
        
        # Store both the summary and original content
        summary_db.add_documents([
            {"content": summary, "type": "summary", "original": doc},
            {"content": doc["content"], "type": "full", "original": doc}
        ])
    
    return summary_db

When a query comes in, search both summaries and full chunks. This improves recall by helping the system find relevant documents even when the query doesn't match specific chunk text.

Level 3: Re-Ranking with Claude

Re-ranking takes the top-k results from your initial retrieval and uses Claude to reorder them by relevance. This dramatically improves MRR.

def rerank_with_claude(query: str, candidates: List[Dict]) -> List[Dict]:
    prompt = f"""
    Given the query: "{query}"
    
    Rank the following passages by relevance (most relevant first):
    
    {chr(10).join([f"{i+1}. {c['content']}" for i, c in enumerate(candidates)])}
    
    Return the indices in order of relevance, separated by commas.
    """
    
    response = anthropic.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )
    
    # Parse the response and reorder candidates
    indices = [int(i) - 1 for i in response.content[0].text.split(",")]
    return [candidates[i] for i in indices]

Results: Before and After

After implementing summary indexing and re-ranking, here's the improvement over the basic RAG pipeline:

MetricBasic RAGAdvanced RAG
Avg Precision0.430.44
Avg Recall0.660.69
Avg F1 Score0.520.54
Avg MRR0.740.87
End-to-End Accuracy71%81%
The most dramatic improvement is in MRR (0.74 → 0.87), thanks to Claude's re-ranking capability. End-to-end accuracy jumped from 71% to 81%, meaning users get correct answers more often.

Production Considerations

  • Rate limits: Full evaluations can hit API rate limits. Consider running smaller eval sets or using Tier 2+ accounts.
  • Cost management: Summary indexing and re-ranking add token costs. Balance improvements against budget.
  • Vector database: For production, use a hosted vector DB with proper indexing and scaling.
  • Evaluation dataset: Maintain a diverse, evolving eval set that reflects real user queries.

Key Takeaways

  • Evaluate retrieval and generation separately to identify where your RAG system needs improvement. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
  • Summary indexing boosts recall by adding high-level document overviews that catch queries missed by chunk-level search.
  • Re-ranking with Claude dramatically improves MRR, ensuring the most relevant information appears first in your results.
  • Start simple, then iterate — a basic RAG pipeline can be surprisingly effective, and targeted improvements (like re-ranking) often yield the biggest gains.
  • Build a synthetic evaluation dataset early in your development process. It's the foundation for objective, reproducible improvements.