Guide2026-05-03

Building a Production-Ready RAG System with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, covering basic setup, evaluation metrics (precision, recall, F1, MRR), and advanced techniques like summary indexing and re-ranking to boost end-to-end accuracy from 71% to 81%.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Introduction

Claude excels at a wide range of tasks, but it may struggle with queries specific to your unique business context. This is where Retrieval Augmented Generation (RAG) becomes invaluable. RAG enables Claude to leverage your internal knowledge bases or customer support documents, significantly enhancing its ability to answer domain-specific questions.

Enterprises are increasingly building RAG applications to improve workflows in customer support, Q&A over internal company documents, financial & legal analysis, and much more.

In this guide, we'll demonstrate how to build and optimize a RAG system using the Claude Documentation as our knowledge base. We'll walk you through:

Setting up a basic RAG system using an in-memory vector database and embeddings from Voyage AI
Building a robust evaluation suite with proper metrics
Implementing advanced techniques like summary indexing and re-ranking with Claude

By the end, you'll understand how to achieve significant performance gains — from 71% to 81% end-to-end accuracy.

Prerequisites and Setup

Before diving in, you'll need:

Python 3.8+
API keys from Anthropic and Voyage AI
Required libraries: anthropic, voyageai, pandas, numpy, matplotlib, scikit-learn

Initialize a Vector DB Class

For this guide, we'll use an in-memory vector database. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

import numpy as np
from typing import List, Dict, Any
class InMemoryVectorDB:
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, docs: List[str], embeddings: List[List[float]]):
        self.documents.extend(docs)
        self.embeddings.extend(embeddings)
    
    def search(self, query_embedding: List[float], top_k: int = 3) -> List[Dict[str, Any]]:
        scores = [self._cosine_similarity(query_embedding, emb) for emb in self.embeddings]
        top_indices = np.argsort(scores)[-top_k:][::-1]
        return [
            {"document": self.documents[i], "score": scores[i]}
            for i in top_indices
        ]
    
    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
        a, b = np.array(a), np.array(b)
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Level 1: Basic RAG (Naive RAG)

A basic RAG pipeline includes three steps:

Chunk documents by heading — containing only the content from each subheading
Embed each chunk using Voyage AI
Retrieve relevant chunks using cosine similarity

Here's a minimal implementation:

import voyageai
vo = voyageai.Client(api_key="your-voyage-api-key")
def embed_chunks(chunks: List[str]) -> List[List[float]]:
    result = vo.embed(chunks, model="voyage-2")
    return result.embeddings
def retrieve(query: str, vector_db: InMemoryVectorDB, top_k: int = 3) -> List[str]:
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    results = vector_db.search(query_embedding, top_k=top_k)
    return [r["document"] for r in results]
def answer_with_claude(query: str, context: List[str]) -> str:
    import anthropic
    client = anthropic.Anthropic(api_key="your-anthropic-api-key")
    
    prompt = f"""Answer the question based on the following context:
Context:
{' '.join(context)}
Question: {query}
Answer:"""
    
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=500,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Building an Evaluation System

When evaluating RAG applications, it's critical to evaluate the retrieval system and end-to-end system separately. We'll use a synthetically generated evaluation dataset of 100 samples, each containing:

A question
Relevant chunks (ground truth)
A correct answer

Key Metrics

#### Retrieval Metrics

Precision measures the proportion of retrieved chunks that are actually relevant:

$$\text{Precision} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Retrieved}|}$$

Recall measures completeness — how many of the correct chunks were retrieved:

$$\text{Recall} = \frac{|\text{Retrieved} \cap \text{Correct}|}{|\text{Correct}|}$$

F1 Score is the harmonic mean of precision and recall:

$$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Mean Reciprocal Rank (MRR) measures how early the first relevant chunk appears in the results:

$$\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}$$

#### End-to-End Metric

End-to-End Accuracy measures whether Claude's final answer is correct based on the retrieved context.

Implementing the Evaluation

def evaluate_retrieval(retrieved_chunks: List[str], correct_chunks: List[str]) -> Dict[str, float]:
    retrieved_set = set(retrieved_chunks)
    correct_set = set(correct_chunks)
    
    true_positives = len(retrieved_set & correct_set)
    
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(correct_set) if correct_set else 0
    f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    # MRR: find rank of first relevant chunk
    mrr = 0
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in correct_set:
            mrr = 1.0 / (i + 1)
            break
    
    return {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "mrr": mrr
    }

Level 2: Summary Indexing

One limitation of basic RAG is that individual chunks may lack context. Summary indexing addresses this by creating a summary of each document section and using it as an additional retrieval target.

def generate_summary(chunk: str, client: anthropic.Anthropic) -> str:
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{
            "role": "user", 
            "content": f"Summarize this text in 1-2 sentences:\n\n{chunk}"
        }]
    )
    return response.content[0].text
def build_summary_index(chunks: List[str]) -> List[str]:
    client = anthropic.Anthropic()
    summaries = [generate_summary(chunk, client) for chunk in chunks]
    # Store both original chunks and summaries in vector DB
    return chunks + summaries

Level 3: Summary Indexing + Re-Ranking

Re-ranking uses Claude to score retrieved chunks by relevance before passing them to the final answer generation step. This significantly improves MRR and end-to-end accuracy.

def rerank_with_claude(query: str, chunks: List[str], top_k: int = 3) -> List[str]:
    client = anthropic.Anthropic()
    
    scored_chunks = []
    for chunk in chunks:
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"On a scale of 0-10, how relevant is this chunk to the query?\n\nQuery: {query}\n\nChunk: {chunk}\n\nRelevance score:"
            }]
        )
        try:
            score = float(response.content[0].text.strip())
        except ValueError:
            score = 0
        scored_chunks.append((score, chunk))
    
    scored_chunks.sort(reverse=True)
    return [chunk for score, chunk in scored_chunks[:top_k]]

Results and Performance Gains

Through these targeted improvements, we achieved significant performance gains:

Metric	Basic RAG	Advanced RAG
Avg Precision	0.43	0.44
Avg Recall	0.66	0.69
Avg F1 Score	0.52	0.54
Avg MRR	0.74	0.87
End-to-End Accuracy	71%	81%

Best Practices for Production RAG

Evaluate retrieval and generation separately — This helps you pinpoint where failures occur
Use high-quality embeddings — Voyage AI and OpenAI embeddings outperform many alternatives
Implement re-ranking — Even a simple Claude-based re-ranker can boost MRR by 13+ points
Consider summary indexing — Helps with queries that span multiple chunks
Monitor rate limits — Full evaluations can hit API limits unless you're on Tier 2+

Key Takeaways

Basic RAG is just the starting point — Naive chunking and cosine similarity alone won't deliver production-quality results. Advanced techniques like summary indexing and re-ranking are essential.
Evaluate retrieval and end-to-end performance separately — This lets you identify whether failures stem from missing context or poor reasoning, enabling targeted improvements.
Re-ranking with Claude dramatically improves MRR — Our implementation boosted MRR from 0.74 to 0.87, ensuring the most relevant chunks appear first.
End-to-end accuracy can improve by 10+ percentage points — Through a combination of summary indexing and re-ranking, we increased accuracy from 71% to 81%.
Build with a production mindset — Use hosted vector databases, implement caching, and monitor rate limits from the start to avoid painful migrations later.