GuideBeginner2026-05-06

Building a Production-Grade RAG System with Claude: From Naive to Optimized

Learn to build, evaluate, and optimize a Retrieval Augmented Generation system with Claude. Covers basic RAG, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced optimizations. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy, then improve results with summary indexing and re-ranking.

RAGClaudeRetrieval Augmented GenerationEvaluationVoyage AI

Building a Production-Grade RAG System with Claude: From Naive to Optimized

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for extending Claude's capabilities to your specific business context. Whether you're building a customer support bot, an internal knowledge base Q&A system, or a financial analysis tool, RAG lets Claude answer questions grounded in your own documents.

In this guide, we'll walk through building a RAG system using the Claude Documentation as our knowledge base. We'll start with a basic implementation, then show you how to measure performance properly, and finally apply advanced techniques that improved our end-to-end accuracy from 71% to 81%.

What You'll Learn

How to set up a basic RAG pipeline with Claude and Voyage AI embeddings
How to build a robust evaluation suite with 5 key metrics
How to implement summary indexing for better retrieval
How to use Claude as a re-ranker to improve result quality

Prerequisites

You'll need:

An Anthropic API key
A Voyage AI API key
Python 3.8+ with anthropic, voyageai, pandas, numpy, matplotlib, and scikit-learn

Level 1: Basic RAG (Naive Approach)

Let's start with the simplest possible RAG implementation. This is often called "Naive RAG" in the industry, and it involves three steps:

Chunk documents by heading (each subheading becomes a chunk)
Embed each chunk using Voyage AI
Retrieve relevant chunks using cosine similarity

Setting Up the Vector Database

For this example, we'll use an in-memory vector database. In production, you'd want a hosted solution like Pinecone, Weaviate, or Chroma.

import voyageai
import numpy as np
from typing import List, Dict
class InMemoryVectorDB:
    def __init__(self, api_key: str):
        self.client = voyageai.Client(api_key=api_key)
        self.documents = []
        self.embeddings = []
    
    def add_documents(self, documents: List[Dict[str, str]]):
        """Add documents with their embeddings."""
        texts = [doc['content'] for doc in documents]
        embeddings = self.client.embed(texts, model="voyage-2").embeddings
        self.documents.extend(documents)
        self.embeddings.extend(embeddings)
    
    def search(self, query: str, top_k: int = 3) -> List[Dict]:
        """Search for relevant documents using cosine similarity."""
        query_embedding = self.client.embed([query], model="voyage-2").embeddings[0]
        similarities = [
            np.dot(query_embedding, emb) / (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
            for emb in self.embeddings
        ]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [self.documents[i] for i in top_indices]

Building the RAG Pipeline

Now let's create the full pipeline that retrieves documents and generates answers with Claude:

from anthropic import Anthropic
class BasicRAG:
    def __init__(self, anthropic_key: str, voyage_key: str):
        self.anthropic = Anthropic(api_key=anthropic_key)
        self.vector_db = InMemoryVectorDB(api_key=voyage_key)
    
    def query(self, question: str) -> str:
        # 1. Retrieve relevant chunks
        chunks = self.vector_db.search(question, top_k=3)
        
        # 2. Build context from retrieved chunks
        context = "\n\n".join([chunk['content'] for chunk in chunks])
        
        # 3. Generate answer with Claude
        response = self.anthropic.messages.create(
            model="claude-3-sonnet-20241022",
            max_tokens=1024,
            system="You are a helpful assistant. Answer the question based on the provided context.",
            messages=[
                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
            ]
        )
        return response.content[0].text

This works, but how well does it actually perform? That's where evaluation comes in.

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production systems. You need to measure two things independently:

Retrieval performance – How well does your system find relevant documents?
End-to-end performance – Does Claude actually answer correctly?

Creating a Test Dataset

We synthetically generated 100 evaluation samples. Each sample contains:

A question
The chunks that are relevant to that question (ground truth)
A correct answer

Some questions require synthesizing information from multiple chunks, making this a challenging dataset.

The Five Key Metrics

#### 1. Precision

Precision answers: "Of the chunks we retrieved, how many were actually relevant?"

Precision = True Positives / Total Retrieved

High precision means you're not wasting Claude's context window on irrelevant information.

#### 2. Recall

Recall answers: "Of all the relevant chunks that exist, how many did we retrieve?"

Recall = True Positives / Total Relevant

High recall ensures Claude has all the information it needs.

#### 3. F1 Score

The harmonic mean of precision and recall, giving you a balanced view of retrieval quality.

F1 = 2  (Precision  Recall) / (Precision + Recall)

#### 4. Mean Reciprocal Rank (MRR)

MRR measures how high the first relevant result appears in your retrieval list. If the first relevant chunk is ranked #1, that's perfect. If it's #3, that's worse.

def calculate_mrr(retrieved_chunks, relevant_chunks):
    for i, chunk in enumerate(retrieved_chunks):
        if chunk in relevant_chunks:
            return 1.0 / (i + 1)
    return 0.0

#### 5. End-to-End Accuracy

This is the ultimate metric: does Claude's final answer match the expected answer? This requires human or LLM-as-judge evaluation.

Running the Evaluation

def evaluate_retrieval(rag_system, eval_dataset):
    results = []
    for sample in eval_dataset:
        retrieved = rag_system.vector_db.search(sample['question'], top_k=3)
        relevant = sample['relevant_chunks']
        
        # Calculate metrics
        true_positives = len(set(retrieved) & set(relevant))
        precision = true_positives / len(retrieved)
        recall = true_positives / len(relevant)
        f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
        mrr = calculate_mrr(retrieved, relevant)
        
        results.append({
            'precision': precision,
            'recall': recall,
            'f1': f1,
            'mrr': mrr
        })
    
    return results

Level 2: Summary Indexing

Our basic RAG had a problem: chunks were too granular. A chunk about "API Rate Limits" might not contain the phrase "how many requests per minute," even though it's the right answer.

Summary indexing solves this by creating a separate index of chunk summaries. When a query comes in, you search the summary index first, then retrieve the full chunks.

def create_summary_index(documents, anthropic_client):
    summary_index = []
    for doc in documents:
        response = anthropic_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=150,
            system="Summarize this document chunk in 1-2 sentences.",
            messages=[{"role": "user", "content": doc['content']}]
        )
        summary_index.append({
            'summary': response.content[0].text,
            'original': doc
        })
    return summary_index

Now when searching, we first find relevant summaries, then retrieve the corresponding full chunks. This improved our recall from 0.66 to 0.69.

Level 3: Summary Indexing + Re-Ranking

The final optimization is re-ranking. After retrieving candidates with cosine similarity, we use Claude to re-rank them based on actual relevance to the query.

def rerank_with_claude(query, candidates, anthropic_client):
    # Ask Claude to score each candidate's relevance
    scores = []
    for chunk in candidates:
        response = anthropic_client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=50,
            system="Rate the relevance of this chunk to the query from 0-10. Return only the number.",
            messages=[{
                "role": "user",
                "content": f"Query: {query}\n\nChunk: {chunk['content']}"
            }]
        )
        try:
            score = float(response.content[0].text.strip())
        except:
            score = 0
        scores.append(score)
    
    # Sort by score descending
    ranked = [c for _, c in sorted(zip(scores, candidates), reverse=True)]
    return ranked

This dramatically improved our MRR from 0.74 to 0.87, meaning the most relevant chunk almost always appears first.

Results Summary

Here's what we achieved with each optimization:

Metric	Basic RAG	+ Summary Index	+ Re-Ranking
Avg Precision	0.43	0.44	0.44
Avg Recall	0.66	0.69	0.69
Avg F1 Score	0.52	0.54	0.54
Avg MRR	0.74	0.74	0.87
End-to-End Accuracy	71%	76%	81%

Key Takeaways

Evaluate retrieval and generation separately. You can't improve what you don't measure. Use precision, recall, F1, and MRR for retrieval, and accuracy or LLM-as-judge for end-to-end performance.
Summary indexing improves recall. By creating searchable summaries, you help the retrieval system find relevant chunks even when the query doesn't match the exact wording.
Re-ranking with Claude significantly boosts MRR. Using Claude to re-rank candidates ensures the most relevant information appears first, which improves final answer quality.
Start simple, then optimize. A basic RAG pipeline works surprisingly well. Only add complexity (summary indexing, re-ranking) when you've measured the baseline and identified specific weaknesses.
Watch your rate limits. Full evaluations can be token-intensive. Consider sampling your dataset or using a smaller model for intermediate steps.