Guide2026-04-27

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Learn to build, evaluate, and optimize a Retrieval Augmented Generation (RAG) system with Claude. Covers basic setup, evaluation metrics, summary indexing, and re-ranking techniques.

Quick Answer

This guide walks you through building a RAG system with Claude, from a basic pipeline to advanced techniques like summary indexing and re-ranking. You'll learn to evaluate retrieval and end-to-end performance using precision, recall, F1, MRR, and accuracy metrics, with code examples throughout.

ClaudeRAGRetrieval Augmented GenerationVoyage AILLM Evaluation

Building a Production-Grade RAG System with Claude: From Basic to Advanced

Retrieval Augmented Generation (RAG) is one of the most powerful patterns for grounding Claude in your specific business context. While Claude excels at general knowledge tasks, it can struggle with queries that require access to your internal documentation, customer support articles, or proprietary data. RAG bridges this gap by dynamically retrieving relevant information from your knowledge base and feeding it into Claude's context window.

In this guide, you'll learn how to build, evaluate, and optimize a RAG system using Claude and Voyage AI embeddings. We'll start with a basic "naive" RAG pipeline and progressively enhance it with advanced techniques like summary indexing and re-ranking. By the end, you'll have a practical understanding of how to achieve significant performance gains—we'll show you how to improve end-to-end accuracy from 71% to 81%.

What You'll Need

Before diving in, make sure you have:

API keys from Anthropic and Voyage AI
Python libraries: anthropic, voyageai, pandas, numpy, matplotlib, scikit-learn

import anthropic
import voyageai
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

Level 1: Basic RAG Pipeline

Let's start with the simplest approach—often called "Naive RAG." This three-step pipeline is the foundation everything else builds on:

Chunk documents by heading (each subheading becomes a separate chunk)
Embed each chunk using Voyage AI
Retrieve relevant chunks via cosine similarity when a query comes in

Step 1: Initialize Your Vector Store

For this example, we'll use an in-memory vector database. In production, you'd likely use a hosted solution like Pinecone, Weaviate, or Chroma.

class InMemoryVectorDB:
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_document(self, doc_id, content, embedding):
        self.documents.append({"id": doc_id, "content": content})
        self.embeddings.append(embedding)
    
    def search(self, query_embedding, top_k=3):
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [self.documents[i] for i in top_indices]

Step 2: Chunk and Embed Your Documents

vo = voyageai.Client(api_key="your-voyage-api-key")
def chunk_by_headings(document_text):
    # Simple splitting by markdown headings
    chunks = []
    current_heading = "Introduction"
    current_content = []
    
    for line in document_text.split("\n"):
        if line.startswith("##"):
            if current_content:
                chunks.append({"heading": current_heading, "content": "\n".join(current_content)})
            current_heading = line.replace("##", "").strip()
            current_content = []
        else:
            current_content.append(line)
    
    if current_content:
        chunks.append({"heading": current_heading, "content": "\n".join(current_content)})
    return chunks
Embed each chunk
chunks = chunk_by_headings(claude_docs)
for i, chunk in enumerate(chunks):
    embedding = vo.embed([chunk["content"]], model="voyage-2").embeddings[0]
    vector_db.add_document(f"chunk_{i}", chunk["content"], embedding)

Step 3: Retrieve and Answer

def basic_rag(query):
    # Embed the query
    query_embedding = vo.embed([query], model="voyage-2").embeddings[0]
    
    # Retrieve top 3 chunks
    retrieved = vector_db.search(query_embedding, top_k=3)
    context = "\n\n".join([doc["content"] for doc in retrieved])
    
    # Generate answer with Claude
    client = anthropic.Anthropic(api_key="your-anthropic-api-key")
    response = client.messages.create(
        model="claude-3-sonnet-20241022",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}\n\nAnswer based on the context provided."
        }]
    )
    return response.content[0].text

Building an Evaluation System

"Vibes-based" evaluation won't cut it for production RAG. You need to measure two things independently:

Retrieval performance: How well does your system find the right chunks?
End-to-end performance: How well does Claude answer questions given those chunks?

The Evaluation Dataset

We'll use a synthetically generated dataset of 100 samples. Each sample contains:

A question
Relevant chunks (the ground truth documents that should be retrieved)
A correct answer

import json
with open("evaluation/docs_evaluation_dataset.json", "r") as f:
    eval_data = json.load(f)
Preview
for sample in eval_data[:3]:
    print(f"Q: {sample['question']}")
    print(f"Relevant chunks: {len(sample['relevant_chunks'])}")
    print(f"Answer: {sample['answer'][:100]}...")
    print("---")

Key Metrics Explained

#### Precision

What it measures: Of all the chunks you retrieved, how many were actually relevant?

$$\text{Precision} = \frac{\text{True Positives}}{\text{Total Retrieved}}$$

High precision = few irrelevant chunks in your results
Low precision = Claude gets distracting noise

#### Recall What it measures: Of all the relevant chunks that exist, how many did you retrieve?

$$\text{Recall} = \frac{\text{True Positives}}{\text{Total Relevant}}$$

High recall = Claude has all the information it needs
Low recall = Claude might miss critical context

#### F1 Score

The harmonic mean of precision and recall. A balanced view of retrieval quality.

$$F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

#### Mean Reciprocal Rank (MRR)

What it measures: How high up in your results does the first relevant chunk appear?

$$MRR = \frac{1}{N} \sum_{i=1}^{N} \frac{1}{rank_i}$$

MRR of 1.0 = first result is always relevant
Critical for user-facing applications where top results matter most

#### End-to-End Accuracy

The final test: does Claude give the correct answer? This is evaluated by comparing Claude's response to the ground truth answer.

def evaluate_retrieval(retrieved_chunks, relevant_chunks):
    retrieved_set = set(retrieved_chunks)
    relevant_set = set(relevant_chunks)
    
    true_positives = len(retrieved_set & relevant_set)
    precision = true_positives / len(retrieved_set) if retrieved_set else 0
    recall = true_positives / len(relevant_set) if relevant_set else 0
    f1 = 2  (precision  recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {"precision": precision, "recall": recall, "f1": f1}

Level 2: Summary Indexing

Basic RAG has a problem: chunks are often too granular. A single chunk might not contain enough context for Claude to understand the full picture. Summary indexing solves this by creating higher-level summaries of document sections.

How It Works

Instead of embedding raw chunks, you:

Group related chunks under their parent heading
Generate a summary of each group using Claude
Embed and index the summaries
Retrieve summaries, then pass the full chunk content to Claude

def generate_summary(chunks, heading):
    combined = "\n".join([c["content"] for c in chunks])
    
    response = client.messages.create(
        model="claude-3-haiku-20240307",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"Summarize the following section '{heading}' in 2-3 sentences:\n\n{combined}"
        }]
    )
    return response.content[0].text
Build summary index
summary_index = {}
for heading, group_chunks in group_by_heading(all_chunks):
    summary = generate_summary(group_chunks, heading)
    summary_embedding = vo.embed([summary], model="voyage-2").embeddings[0]
    summary_index[heading] = {
        "summary": summary,
        "embedding": summary_embedding,
        "chunks": group_chunks
    }

Why It Works

Summaries capture the essence of a section, making retrieval more accurate. When a query matches a summary, you retrieve all chunks under that heading—giving Claude richer context.

Result: Average recall improved from 0.66 to 0.69 in our tests.

Level 3: Summary Indexing + Re-Ranking

Re-ranking is the secret weapon for production RAG. After initial retrieval, you use Claude to re-rank the results based on relevance to the specific query.

The Re-Ranking Workflow

Retrieve top 10 candidates using summary indexing
For each candidate, ask Claude to score relevance (1-5) to the query
Sort by score and take the top 3

def rerank_with_claude(query, candidates, top_k=3):
    scored = []
    
    for candidate in candidates:
        response = client.messages.create(
            model="claude-3-haiku-20240307",
            max_tokens=10,
            messages=[{
                "role": "user",
                "content": f"On a scale of 1-5, how relevant is this document to the question?\n\nQuestion: {query}\n\nDocument: {candidate['content'][:500]}\n\nAnswer with only a number."
            }]
        )
        score = int(response.content[0].text.strip())
        scored.append((score, candidate))
    
    # Sort by score descending and take top_k
    scored.sort(reverse=True, key=lambda x: x[0])
    return [candidate for _, candidate in scored[:top_k]]

Why Re-Ranking Matters

Re-ranking dramatically improves Mean Reciprocal Rank (MRR)—the first retrieved chunk is far more likely to be relevant. In our tests, MRR jumped from 0.74 to 0.87.

Putting It All Together: Performance Gains

Here's what we achieved by layering these techniques:

Metric	Basic RAG	Summary Indexing	+ Re-Ranking
Avg Precision	0.43	0.44	0.44
Avg Recall	0.66	0.69	0.69
Avg F1 Score	0.52	0.54	0.54
Avg MRR	0.74	0.80	0.87
End-to-End Accuracy	71%	76%	81%

Production Considerations

Rate limits: Full evaluation runs can hit API rate limits. Consider using Tier 2+ accounts or running evaluations incrementally.
Token costs: Summary indexing and re-ranking add token usage. Balance quality gains against cost.
Vector database: For production, use a hosted vector DB with built-in indexing and filtering.
Chunking strategy: Experiment with different chunk sizes and overlap. There's no one-size-fits-all.

Key Takeaways

Evaluate retrieval and generation separately to pinpoint where your RAG system needs improvement. Use precision, recall, F1, and MRR for retrieval; accuracy for end-to-end.
Summary indexing bridges the granularity gap by grouping related chunks under high-level summaries, improving recall without sacrificing precision.
Re-ranking with Claude is a powerful but often overlooked technique that significantly boosts MRR and end-to-end accuracy.
Start simple, then iterate—a basic RAG pipeline can be surprisingly effective. Add complexity (summary indexing, re-ranking) only when you have data proving the need.
Your evaluation dataset is your compass—invest time in creating a high-quality, representative set of questions and ground-truth answers. It will guide every optimization decision.